Course12:Advanced - VoIP Protocols

From innovaphone-wiki

Jump to: navigation, search

This book describes the different protocol classes used in VoIP communications.



The idea behind Voice over IP is to transport voice (or video) streams between two or more IP endpoints. Since the transported data is real-time media, special conditions must be fulfilled. To construct and synchronize the voice streams from the received IP packets, each packet must be tagged with a timestamp. Since none of the existing transport - protocols offered this stamping mechanism, a new protocol called RTP (Real-time Transport Protocol) was introduced.

The services provided by RTP include:
  • Payload-type identification - Indication of what kind of content is being carried
  • Sequence numbering - PDU sequence number
  • Time stamping - allow synchronization and jitter calculations
  • Delivery monitoring
However before RTP packets are sent between the communication partners, the parameters for the RTP session must be negotiated. This negotiation process is done using a signalling protocol.

screenshot.png VoIP-protocols

As already mentioned, VoIP protocols can be divided into two categories, signalling and media protocols. While RTP is currently the only media protocol in use, the wikipedia.ico list of signalling protocols is rather long. The most important signalling protocols, and also the only implemented by innovaphone, are H.323 and SIP.


A codec converts analog audio signals into digital signals for transmission. A receiving device then converts the digital signals back to analog using an audio decompressor, for playback. The term codec is a combination of 'coder-decoder'.

The wikipedia.ico list of existing codecs is rather long, however there are some standards released by the ITU-T that are supported by most manufacturers. These are also the only codecs used by innovaphone devices:
  • G711 A/u
  • G729
  • G.723
  • G.722
  • OPUS-W
Ok, so we have 6 codecs but what is the difference between them?

Each of them uses an own compression algorithm, resulting in a different audio quality and bandwidth requirements.


Sample Rate
Codec net Bandwith
approx. Wire Bandwidth (incl. Overhead)
G.711 A/u
8 kHz
64 kbit/s 85 kbit/s
8 kHz 8 kbit/s 32 kbit/s low
8 kHz ~6 kbit/s 22 kbit/s
16 kHz 64 kbit/s
85 kbit/s
OPUS-WB 16 kHz 19 kbits/s
45 kbit/s
8 kHz 11 kbit/s
35 kbit/s

The discrepancy between both bandwidth requirements comes from the use of a packet switched network. The encoded voice samples are encapsulated in RTP packets, RTP in UDP, UDP in IP packets and so on. This creates substantial overhead.
screenshot.png Bandwith-requirements

The effective bandwidth used also depends on the length of the encoded sample that is sent in a single packet. The most popular length used is 20ms. That is, a single RTP packet will carry 20ms worth of voice data. However, if you configure a larger packet size (say, 40ms), then more data is sent in a packet but the header overhead remains the same (as it depends on the number of packets sent, not the size of the voice data sent). In effect, using large packet sizes will greatly reduce overhead. However, at the same time it will of course increase the transmission delay.

(Further Hints) All bandwidths are "per direction". Each call will thus consume two times this bandwidth.


Fax transmission (as all other modem transmission signals) is destroyed when a codec compresses the voice data (as such codecs are "lossy", that is, the decompressed audio is not identical to the audio before compression).

This is why such data must be transferred using G711 (also known as 8kHz transparent), as this is a non-compressing codec. There are 2 issues with this:
  • it consumes a lot of bandwidth and
  • it is still badly affected by packet loss and jitter

A technology known as T.38 has been developed to address both issues. T.38 defines the digital transmission of T.30 fax data over UDP. Fax modem signals are decoded on the sending end and the pure T.30 fax data is sent via UDP. For this to work, the sender needs to employ a full working fax modem. On the receiving end, the T.30 data received is either processed directly (the innovaphone fax server works this way), or it is re-encoded as fax modem data and send to a conventional fax device (usually connected via an ATA).

This technology consumes much less bandwidth and also is less vulnerable to packet loss. However, it requires expensive resources for modem/T.38 conversion (usually DSPs). It is therefore often not implemented in cheaper ATAs or line boards.


Beside audio, it's as well possible to negotiate the use of video via H.323/SIP. Before transmission via RTP, the video data captured by any video device (e.g. camera), is de-/compressed using special video codecs such as H.264 or VP8.

H.264 is the successor of H.263 and was designed to cover application scenarios like HDTV, Portable video, multimedia, video conferencing and many more.
Depending on the resolution and image refresh rate (fps) of the capturing device, bandwidth requirements are starting from ca. 300 kbit/s for a one-way communication.
innovaphone supports video over IP in the UC client myPBX.


H.323 is a system specification that describes the use of several ITU-T and IETF protocols. The protocols that comprise the core of almost any H.323 system are:

  • H.225.0 Registration, Admission and Status (RAS), which is used between an H.323 endpoint and a Gatekeeper to provide address resolution and admission control services.
  • H.225.0 Call Signalling, which is used between any two H.323 entities in order to establish communication.
  • H.245 control protocol for multimedia communication, which describes the messages and procedures used for capability exchange, opening and closing logical channels for audio, video and data, control and indications.
  • Real-time Transport Protocol (RTP), which is used for sending or receiving multimedia information (voice, video, or text) between any two entities.

Many H.323 systems also implement other protocols that are defined in various ITU-T recommendations to provide supplementary services support or deliver other functionality to the user. Some of those recommendations are:

  • H.235 series describes security within H.323, used by innovaphone for password and SRTP key encryption.
  • H.450 series describes various supplementary services (e.g. Call Pickup, MWI).

In addition to those ITU-T recommendations, H.323 utilizes various IETF Request for Comments (RFCs) for media transport and media packetization, including the Real-time Transport Protocol (RTP).

H.323 Architecture

The H.323 system defines several network elements that work together in order to deliver rich multimedia communication capabilities. Those elements are Terminals, Multipoint Control Units (MCUs), Gateways, and Gatekeepers. Collectively, terminals, multipoint control units and gateways are often referred to as endpoints.

While not all elements are required, at least two terminals are required in order to enable communication between two people. In most H.323 deployments, a gatekeeper is employed in order to, among other things, facilitate address resolution.

H.323 Network Elements


Terminals in an H.323 network are the most fundamental elements in any H.323 system, as those are the devices that users would normally encounter. They normally exist in the form of an IP phone.

Multipoint Control Units (MCU)

A Multipoint Control Unit (MCU) is responsible for managing multipoint conferences. In more practical terms, an MCU is a conference bridge not unlike the conference bridges used in the PSTN today. The most significant difference, however, is that H.323 MCUs might be capable of mixing or switching video, in addition to the normal audio mixing done by a traditional conference bridge.


Gateways are devices that enable communication between H.323 networks and other networks, such as PSTN or ISDN networks. If one party in a conversation is utilizing a terminal that is not an H.323 terminal, then the call must pass through a gateway in order to enable both parties to communicate.


A Gatekeeper is an optional component in the H.323 network that provides a number of services to terminals, gateways, and MCU devices. Those services include endpoint registration, address resolution, admission control, user authentication, and so forth. Of the various functions performed by the gatekeeper, address resolution is the most important as it enables two endpoints to contact each other without either endpoint having to know the IP address of the other endpoint.

Gatekeepers may be designed to operate in one of two signalling modes, namely "direct routed" and "gatekeeper routed" mode. Direct routed mode is the most efficient and most widely deployed mode. In this mode, endpoints utilize the RAS protocol in order to learn the IP address of the remote endpoint and a call is established directly with the remote device. In the gatekeeper routed mode, call signalling always passes through the gatekeeper. While the latter requires the gatekeeper to have more processing power, it also gives the gatekeeper complete control over the call and the ability to provide supplementary services on behalf of the endpoints. innovaphone device work in "gatekeeper routed" mode.

Mapped to innovaphone devices these H.323 elements would correspond to:

  • Terminal -> IP phone
  • MCU -> innovaphone gateway working as conference server
  • Gateway -> all innovaphone gateways.
  • Gatekeeper -> all devices running the innovaphone PBX

H.323 - Signaling

As mentioned in the previous chapter, H.323 defines three main protocols for call signalling: RAS, H.225 and H.245.


RAS (Registration, Admission and Status) is a communication protocol between a H.323 Terminal and a Gatekeeper. Unlike the other H.323 signalling protocols, RAS uses UDP as underlying transport protocol.

The main functions of RAS are:
  • Gatekeeper Discovery
  • Registration of Terminals at the Gatekeeper
  • Call admission and address resolution

screenshot.png RAS

When an endpoint is powered on, it will generally send either a gatekeeper request (GRQ) message to "discover" gatekeepers that are willing to provide service or will send a registration request (RRQ) to a gatekeeper that is predefined in the system’s administrative setup. Gatekeepers will then respond with a gatekeeper confirm (GCF). If a GRQ has been sent the endpoint will then select a gatekeeper with which to register by sending a registration request (RRQ), to which the gatekeeper responds with a registration confirm (RCF). At this point, the endpoint is known to the network and can make and place calls.

When an endpoint wishes to place a call, it will send an admission request (ARQ) to the gatekeeper. The gatekeeper will then resolve the address and return the address of the remote endpoint in the admission confirm message (ACF). The endpoint can then place the call.


Once the address of the remote endpoint is resolved using RAS, the terminal will use H.225 in order to establish, control and end a H.323 call. The H.225 call signalling is based on the call setup procedures for ISDN, described in the Q.931 / Q.930 standards. Simplified one can say that the H.225 represent an IP implementation of the ISDN D - channel methods.

screenshot.png H225

In the example above, we will discuss the basic signalling methods in H.225. Also we will concentrate on the "gatekeeper - routed" mode, since this is the common method used with innovaphone devices.
The call is started by Alice sending a SETUP message (1) to the gatekeeper. A SETUP ACKNOWLEDGE message (2) notifies the caller that the request is being processed. The gatekeeper will forward the SETUP message (3) to Bob's terminal, normally resulting into a ringing tone being played on the phone. This is indicated by the Alerting message (4). If Bob picks up the call, a Connect message (5) is sent to the gatekeeper and then gets forwarded to Alice.

The Call Termination is signalled by the Release Complete message (6).


While H.225 is used to signal the remote terminal a call request, it lacks the methods for opening RTP channels needed for the transport of voice/video data. This task is performed by the H.245 protocol.
The main functions of H.245 are:
  • exchange of terminal capabilities (e.g. supported audio codecs)
  • master/slave determination
  • establish, control and terminate logical channels (RTP/RTCP)
Capability Negotiation

Of the functionality provided by H.245, capability negotiation is arguably the most important, as it enables devices to communicate without having prior knowledge of the capabilities of the remote entity. H.245 enables rich multimedia capabilities, including audio, video, text, and data communication. For transmission of audio, video, or text, H.323 devices utilize both ITU-defined codecs and codecs defined outside the ITU. Codecs that are widely implemented by H.323 equipment include:

* Video codecs: H.261, H.263, H.264
* Audio codecs: G.711, G.729, G.729a, G.723.1, G.726, G.722, OPUS

When an H.323 device initiates communication with a remote H.323 device and when H.245 communication is established between the two entities, the Terminal Capability Set (TCS) message is the first message transmitted to the other side.

Master/Slave Determination

After sending a TCS message, H.323 entities (through H.245 exchanges) will attempt to determine which device is the "master" and which is the "slave." This process, referred to as Master/Slave Determination (MSD), is important, as the master in a call settles all negotiation conflicts between the two devices.

Logical Channel Signaling

Once capabilities are exchanged and master/slave determination steps have completed, devices may then open "logical channels" or media flows. This is done by simply sending an Open Logical Channel (OLC) message and receiving an acknowledgement message. Upon receipt of the acknowledgement message, an endpoint may then transmit audio or video to the remote endpoint.

H.323 Fast Connect and H.245 Tunneling

The original H.323 signalling protocols underwent many changes in order to shorten the time needed for the RTP session establishment.

Fast Connect (FC)

The first thing to improve was the rather long H.245 message handshake and the support of "early media". Early media is a term used for the setup of RTP channels between the communication partners, before the call has been accepted (Connect) by both endpoints. This feature is used to play announcements or dialtones to the waiting caller.

screenshot.png h.323-simple

As shown in the upper right picture, the OLC (Open Logical Channel) message, a H.245 message, is sent encapsulated in a H.225 message (Connect, Alerting). The drawback of Fast Connect is that it can be used only in homogeneous (all device are compatible) environment. By sending the OLC without initially checking the capabilities (TCS) of the remote terminal, it is assumed that both terminal support the same set of capabilities.

As shown in the picture above, the RTP Stream (red) goes directly from endpoint to endpoint. However some customer scenarios require the voice stream to pass through the PBX (e.g. firewall traversal). innovaphone defined the term 'Media Relay' for gateways working in this mode. The redirection of voice data through the PBX has one main disadvantage, it creates a high CPU load on the 'relaying' gateway. Therefore this option is usually off by default and must be enabled manually.

H.245 Tunnelling

This method is the logical enhancement of the Fast connect procedure, since it encapsulates not only the OLC messages but every H.245 message in H.225 messages. As a result the separate H.245 TCP connection between the conversation partners is not needed. This saves processing power as well as TCP sockets on the innovaphone hardware, and also eases firewall traversal.

Extended Fast Connect(EFC)

EFC fastens the renegotiation of logical channel attributes during a conversation (e.g. a terminal is put on hold and receives MoH). Instead of running through the complete H.245 handshake process, a change of RTP attributes is done by sending a single OLC message to the remote endpoint. Upon it's receipt, the terminal will close the old logical channel and open a new one using the newly obtained parameters.

Each of these enhancements to the original H.323 protocol is implemented by innovaphone. To improve interoperability with 3rd party vendors, it is possible to disable FC and H.245 tunnelling at the GW - interface.

H.323 - TCP/TLS

H.323 over TCP (defined in H.460.17) continues the idea of H.245 tunnelling and sends also the RAS messages over the H.225 TCP-connection. As a result, only one TCP connection is created to the gatekeeper(PBX). This connection is used to transport registration information, as well as all information needed to establish a call.
An additional advantage of using one TCP connection, is that it can be encrypted using TLS.

The drawback of using H.323 over TCP/TLS is that the RAS Gatekeeper Discovery cannot be used. Since the phone/endpoint cannot discover its gatekeeper(PBX) in the LAN, each endpoint must be configured with its gatekeeper IP-address. This can be distributed by DHCP to innovaphone phones, if a larger number of endpoints must be supplied with the gatekeeper IP-address.


SIP clients typically use TCP or UDP (typically on port 5060 and/or 5061) to connect to SIP servers and other SIP endpoints. SIP is primarily used in setting up and tearing down voice or video calls. However, it can be used in any application where session initiation is a requirement. These include Event Subscription and Notification, Terminal mobility and so on. There are a large number of SIP-related RFCs that define behaviour for such applications. All voice/video communications are done over separate session protocols, typically RTP.

SIP works in concert with several other protocols and is only involved in the signalling portion of a communication session. SIP is a carrier for the Session Description Protocol (SDP), which describes the media content of the session, e.g. what IP ports to use, the codec being used etc. In typical use, SIP "sessions" are simply packet streams of the Real-time Transport Protocol (RTP). RTP is the carrier for the actual voice or video content itself.

screenshot.png Typical_SIP_Stack

SIP is similar to HTTP and shares some of its design principles: It is human readable and request-response structured. SIP shares many HTTP status codes, including the familiar '404 not found'.

SIP Architecture

SIP User Agents (UAs) are the end-user devices, used to create and manage a SIP session. A SIP UA has two main components:
  • the User Agent Client (UAC), which sends messages and answers with SIP responses,
  • the User Agent Server (UAS), which responds to SIP requests sent by the peer.
SIP UAs may work in point to point mode. Typical implementations of a UA are SIP softphones, SIP hardphones and SIP-enabled ATAs.

SIP also defines server network elements. Although two SIP endpoints can communicate without any intervening SIP infrastructure, which is why the protocol is described as peer-to-peer, this approach is impractical for a public service. There are various implementations that can act as SIP servers:

Proxy Server:

A Proxy Server is responsible to route incoming call requests to the intended recipient.
screenshot.png SIP-Proxy

Upon receiving of a call (msg. 1) from one UA (Alice), the SIP Proxy looks up the address of the callee (Bob) at the registrar responsible for this domain (msg. 2). Then the server will create a new SIP session to Bob and forward signalling messages between both endpoints. This corresponds to the "Gatekeeper - routed" mode in H.323, as the Proxy remains always in the communication path.

Registrar Server:

The role of this server type is to accept and check the credentials of incoming UA registrations. If the client is allowed to use the SIP service, the Registrar will store it's IP-address and by this to make a SIP username <-> IP - address mapping. For this reason a Proxy or Redirect server are able to contact the Registrar, in order to find out the IP-address under which a UA can be reached.

Innovaphone combined the SIP Proxy server as well as the Registrar server functionality in the PBX software. As the PBX is also H.323 capable (Gatekeeper), it can communicate simultaneously with both H.323 and SIP clients.

SIP - Signalling

SIP is a request - response structured protocol, very similar to HTTP. The SIP requests are used to initiate an action, for example a phone conversation. The SIP Responses indicate whether a request succeeded or failed, and in the latter case, why it failed.

The most important request are INVITE and ACK used for call establishment, respectively BYE for call termination. The number of responses is rather vast, however the most used is 200 OK to successfully confirm a request.

Have a look at these wikipedia articles for a complete list of the wikipedia.ico SIP requests methods and wikipedia.ico SIP response codes.

SIP Call Establishment and Call Termination

screenshot.png SIP_-_Signaling

Call Establishment

As shown in the picture above, Alice initiates the call by sending an Invite message (1) to the Proxy Server of the domain The server answers Alice Invite using a 100 Trying message(2), indicating that the requests is being processed. In order to route the call to the correct IP - endpoint, the call server must request Bob's current location (3) at the Registrar responsible for the domain. After successfully completing the IP - address lookup, the proxy forwards the Invite message (4) to Bob's phone.
The called UA will respond with a 180 Ringing message(5), indicating that the call request was accepted and the phone is ringing. Upon receiving the 180 Ringing message, the proxy server will forward it to Alice UA. Finally a 200 OK message (6) is sent to the server, when Bob picks up the phone. As with the previous 180 Ringing, this response is also forwarded to Alice. The receipt of the 200 OK is confirmed using an ACK message (7) by Alice.

As shown in the graphic above, there are certain messages (Invite, 180 Ringing & 200 OK) which have an additional encapsulated SDP packet. The SDP part is used to negotiate the RTP parameters for this session.
The Invite SDP message contains a list of the UA's supported codecs and the IP-address and port number used to receive RTP packets. It's up to the called UA to choose the codec to use for the session. The codec selection is done by comparing the received codec list with the own list, and selecting a preferred codec from the subset of both lists.

Using the exchanged SDP information, two RTP channels (one for each direction) are set up between the UAs. The RTP packets do not pass through the SIP proxy but go directly between the communicating endpoints.

Call Termination

Eventually one of the conversation participants (in our example Alice) will hang up, resulting in a BYE message (8) being sent to the SIP proxy. The message is forwarded to Bob and confirmed using the 200 OK (9) response. Both UAs will now clear their RTP channels and return to idle mode.

H.323 vs. SIP

As we know by now SIP and H.323 are two concurrent signalling protocols. You will find lengthy debates in the Internet, discussing which protocol is better and will prevail over the other .

The H.323 specification was written by the ITU -T (International Telecommunication Union), a group of telecommunications specialists who also developed the ISDN standard. Therefore H.323 heavily relies on message formats and structures used in ISDN (i.e. Q.931) and offers a good interoperability to PSTN networks.

The SIP standard was introduced by the IETF (Internet Engineering Task Force), a group of network protocol specialist responsible also for other famous network protocols like HTTP. This is also the reason for the similarity between HTTP and SIP request and response messages.

As it is now, both protocols have reached a mature state and are constantly developed and improved. It is very probable that both protocols will also coexists in the future.

For a complete comparison of H.323 and SIP, please have a look at this www.png article.

Within the innovaphone PBX, H.323 should be used always.

SIP should be used in these cases:
  • connection to a SIP provider
  • external federation to non-innovaphone systems
  • connection to a 3rd party devices (even if the device supports H.323, as SIP interoperability is usually better than H.323 interoperability)
Personal tools