Reference13r3:Concept Text

From innovaphone wiki
Jump to navigation Jump to search


Applies To

  • innovaphone PBX from version 13r3
  • innovaphone Gateways and IPVA

Requirements

  • innovaphone PBX
  • Channel license per transcription channel to use in parrallel

Overview

The TEXT-interface provides the ability to transcript Speech to Text via an external transcription Service. Transcripted speech is returned as UUI (User-to-User Information) back to the caller. The TEXT-Interface can currently only be used with a separate app, which has to be programmed. Please have a look to the SDK Documentation.

Concept

The idea is not to built in transcription service inside the TEXT-Interface. Therefore the TEXT-Interface is using an external transcription service, which could be for example IBM Watson. To get the audio to the transcription service a call is established to the TEXT-Interface. Within this call the configuration data is sent via UUI to the TEXT-Interface. The TEXT-Interface uses the configuration data to connect to the transcription service. When the connection to the transcription service is established, the audio stream gets transmitted to the service and the transcribed text is sent back to the TEXT-Interface. There the text gets packed in a UUI message and is sent to the original caller.

Text-Interface-example-flow.png

Configuration

The configuration of the TEXT-interface is done via UUI JSON messages during a call to the TEXT-interface. There can be different transcription providers, but currently only IBM-Watson is supported. Therefore, it's possible to select a provider in the JSON Message and give it some params:

{"textservice": "SERVICE-NAME", "params": { SERVICE-PARAMETER }}

IBM-Watson

The API-Documentation, which is used for the service, can be found here: IBM-Watson SpeechToText

textservice
"ibm-watson"

There are some additonal parameters for authentication and language selection:

api-key
API-Key as provided by IBM
location
Location of the used server as provided by IBM
instance_id
Instance ID of the Server as provided by IBM
language
The language which is within the audio stream and that should be transcripted


Sample-Config message

{
  "textservice":"ibm-watson",
  "params":{
    "api-key":"xxxxxxx-xxxx",
    "location":"eu-de",
    "instance_id":"xxxx-yyyy-zzzz-qqqq",
    "language":"de-DE"
  }
}

The authentication is done via a HTTP request with the API-Key against the configured instance. The audio stream is then send via a websocket connection to the transcription service and the transcripted text is send back via the same websocket-connection.

Currently supported audio-codecs of the TEXT-Interface with IBM-Watson are G711A (alaw) and G711U (mulaw) with a sampling-rate of 8000kHz. Supported audio-codecs of the IBM Speech To Text Services are described here: Speech to Text Audio Formats

The transmission of the audio is send with at least 1000 bytes. So, if the PBX is configured with a 20ms framesize we have 160 bytes per frame. The limit of 1000 bytes is reached with 7 frames, therefore the transmission time between two frames is 140ms.

Sample Application

To get the the STT-App to run you have to register at the IBM-Watson services in the IBM-Cloud. After the registration a "Speech to Text-nn" ressource has to be enabled. The API-Key, the location (e.g. eu-de in the URL) and the Instance-ID (last Part in the URL) needs to be copied into the STT-App Configuration in the PBX-Manager. The same has to be done for the translation with a "Language Translator-1u" ressource. The STT-App can then be supplied to the Users via a template.

The TEXT-Interface has to be registered to a User-Object (name is required as additional Hardware-id), and the Number of this User-Object has to be entered into the "Text E164" field in the STT configuration within the PBX-Manager. The Name of the PBX in the User-Object has to be entered in the "Text PBX" field.

Related Articles