Technical information about the GVC Emotion Recognition


GVC NEURO - EMO ANALYSIS

Introduction

The GVC - Neuro & Emo Analysis Service detects emotions directly from the voice and from speech samples.

The Emotion Recognition software of Good Vibrations Company is web based. This enables a straightforward implementation in your own software. We use standard protocols like WS, WSS and JSON.


Qries

License

GVC provides its' software on the basis of license agreements. After the license agreement has been concluded, you receive a Token that authorizes you to use the web service until the agreed expiration date.

Delivery of speech data

Speech samples must be delivered using the WS or WSS protocol. We currently support primarily PCM. Other audio formats, like WAV and MP3, are foreseen to be supported in the second quarter of 2020. A server connection is for a mono channel. If you want to handle a stereo channel you could combine the stereo channels to a mono channel or use two separate connections.


Emotion probability responses

The analysis of the speech results in an assessment of the probability of specific emotions in the individuals’ voice. The emotion probabilities of the signal processed via the voice channel will be reported in a fixed interval. The length of the interval in seconds can be configured with the URL key "ResponseTime". The server will then send a response in JSON format.


Processing of the speech sample

The speech sample is processed in 2 phases. In the first phase we calculate some essential indicators that characterise the voice. These indicators are for example pitch, intensity, spectral slope, dynamics and jitter.


Neural Network

In the second phase we predict the emotions from the indicators using a neural network. The neural network is trained, among others, with the scientifically annotated speech emotion databases of several leading universities.


Connection Setup

To setup a connection you must open a standard WebSocket connection.


Command

wss://www.gvcemo.com/realtime?Frequency=44100&SampleSize=2&ResponseTime=2 The URL can have the following keys: Key Values Default Description Format PCM PCM format of the voice data Frequency 8000-48000 44100 sampling frequency in Hz SampleSize 2-3 2 byte size of samples ResponseTime 0.5-900 2 response time in seconds (you can specify tenth of a second) The initial WSS request has a special Request Header key: Key Value Description GVCemo-S2E-Key the key to authorise your request The initial WSS request has a special Request Header key: Key Value Description GVCemo-S2E-Key the key to authorise your request Note: the GVCemo-S2E-Key can be passed in the URL or in the Request Header. It is unsafe when you pass it in the URL, so the Request Header is preferred.

Streaming Commands

When the connection is established you must send commands as text frames to the server. The sound data must be send as data frames. The server sends all the responses as text frames.

Command Description start Start the GVC-emo service. When the connection was successful initialised the server responds with the JSON command { "state": "listening" }. You can now send the audio. The server closes the connection when the initialisation failed. analyse Force analysis of the received data, This supersedes ResponseTime. The server returns the analysis results. stop Stop the GVC-emo service. The server returns the last analysis results and will send a JSON command { "state": "stopped" }

Audio

The service supports currently only pure PCM which must be coded between 8000 Hz and 48000 Hz. The sample size can be 2 or 3 bytes (16 or 24 bits). The samples can not be interleaved.


Output response format

The service returns all output in JSON.

The response contains the following standard fields.


time seconds since the sound started happy probability that the speaker is happy sad probability that the speaker is sad angry probability that the speaker is angry afraid probability that the speaker is afraid balanced probability that the speaker is balanced mood mood calculated mood of the speaker as a value between 0 (negative) and 1 (positive) mood is calculated using the other emotions total total number of 10 millisecond slices received valid number of valid slices of speech, i.e. a voice has been detected

Typical response

{"time": 0.5,"happy": 0.379,"sad": 0.055,"angry": 0,"afraid": 0.108,"balanced": 0.456,"mood": 0.836,"total": 50,"valid":47} {"time": 1,"happy": 0.125,"sad": 0.001,"angry": 0.169,"afraid": 0.074,"balanced": 0.627,"mood": 0.753,"total": 50,"valid": 28} {"time": 1.39,"happy": 0,"sad": 0.255,"angry": 0.001,"afraid": 0,"balanced": 0.743,"mood": 0.743,"total": 39,"valid": 33}

Example of a conversation.

Client GVCemoServer Setup a websocket connection → ← Accept connection Send start → ← nd “state”: “listening” Send continuous sound samples → ← Send response every seconds ... ... ← Send response every seconds Send stop → ← Send last response ← Send “state”: “stopped” Close the websocket connection → The websocket connection is closed