Course12:Basic - VoIP Protocols
This book describes the different protocol classes used in VoIP communications.
Concept
The services provided by RTP include:
- Payload-type identification - Indication of what kind of content is being carried
- Sequence numbering - PDU sequence number
- Time stamping - allow synchronization and jitter calculations
- Delivery monitoring
As already mentioned, VoIP protocols can be divided into two categories, signalling and media protocols. While RTP is currently the only media protocol in use, the list of signalling protocols is rather long. The most important signalling protocols, and also the only implemented by innovaphone, are H.323 and SIP.
Codecs
The list of existing codecs is rather long, however there are some standards released by the ITU-T that are supported by most manufacturers. These are also the only codecs used by innovaphone devices:
- G711 A/u
- G729
- G.723
- G.722
- OPUS-NB
- OPUS-W
Each of them uses an own compression algorithm, resulting in a different audio quality and bandwidth requirements.
Bandwidth
Codec | Sample Rate | Codec net Bandwith | approx. Wire Bandwidth (incl. Overhead) | Quality |
G.711 A/u | 8 kHz | 64 kbit/s | 85 kbit/s | medium |
G.729 | 8 kHz | 8 kbit/s | 32 kbit/s | low |
G.723 | 8 kHz | ~6 kbit/s | 22 kbit/s | bad |
G.722 | 16 kHz | 64 kbit/s | 85 kbit/s | high |
OPUS-WB | 16 kHz | 19 kbits/s | 45 kbit/s | high |
OPUS-NB | 8 kHz | 11 kbit/s | 35 kbit/s | medium |
The discrepancy between both bandwidth requirements comes from the use of a packet switched network. The encoded voice samples are encapsulated in RTP packets, RTP in UDP, UDP in IP packets and so on. This creates substantial overhead.
The effective bandwidth used also depends on the length of the encoded sample that is sent in a single packet. The most popular length used is 20ms. That is, a single RTP packet will carry 20ms worth of voice data. However, if you configure a larger packet size (say, 40ms), then more data is sent in a packet but the header overhead remains the same (as it depends on the number of packets sent, not the size of the voice data sent). In effect, using large packet sizes will greatly reduce overhead. However, at the same time it will of course increase the transmission delay.
All bandwidths are "per direction". Each call will thus consume two times this bandwidth.
Fax
Fax transmission (as all other modem transmission signals) is destroyed when a codec compresses the voice data (as such codecs are "lossy", that is, the decompressed audio is not identical to the audio before compression).
This is why such data must be transferred using G711 (also known as 8kHz transparent), as this is a non-compressing codec. There are 2 issues with this:
- it consumes a lot of bandwidth and
- it is still badly affected by packet loss and jitter
A technology known as T.38 has been developed to address both issues. T.38 defines the digital transmission of T.30 fax data over UDP. Fax modem signals are decoded on the sending end and the pure T.30 fax data is sent via UDP. For this to work, the sender needs to employ a full working fax modem. On the receiving end, the T.30 data received is either processed directly (the innovaphone fax server works this way), or it is re-encoded as fax modem data and send to a conventional fax device (usually connected via an ATA).
This technology consumes much less bandwidth and also is less vulnerable to packet loss. However, it requires expensive resources for modem/T.38 conversion (usually DSPs). It is therefore often not implemented in cheaper ATAs or line boards.
Video
H.323
- H.225.0 Registration, Admission and Status (RAS), which is used between an H.323 endpoint and a Gatekeeper to provide address resolution and admission control services.
- H.225.0 Call Signalling, which is used between any two H.323 entities in order to establish communication.
- H.245 control protocol for multimedia communication, which describes the messages and procedures used for capability exchange, opening and closing logical channels for audio, video and data, control and indications.
- Real-time Transport Protocol (RTP), which is used for sending or receiving multimedia information (voice, video, or text) between any two entities.
Many H.323 systems also implement other protocols that are defined in various ITU-T recommendations to provide supplementary services support or deliver other functionality to the user. Some of those recommendations are:
- H.235 series describes security within H.323, used by innovaphone for password and SRTP key encryption.
- H.450 series describes various supplementary services (e.g. Call Pickup, MWI).
In addition to those ITU-T recommendations, H.323 utilizes various IETF Request for Comments (RFCs) for media transport and media packetization, including the Real-time Transport Protocol (RTP).
H.323 Architecture
While not all elements are required, at least two terminals are required in order to enable communication between two people. In most H.323 deployments, a gatekeeper is employed in order to, among other things, facilitate address resolution.
H.323 Network Elements
Terminals
Terminals in an H.323 network are the most fundamental elements in any H.323 system, as those are the devices that users would normally encounter. They normally exist in the form of an IP phone.
Multipoint Control Units (MCU)
A Multipoint Control Unit (MCU) is responsible for managing multipoint conferences. In more practical terms, an MCU is a conference bridge not unlike the conference bridges used in the PSTN today. The most significant difference, however, is that H.323 MCUs might be capable of mixing or switching video, in addition to the normal audio mixing done by a traditional conference bridge.
Gateways
Gateways are devices that enable communication between H.323 networks and other networks, such as PSTN or ISDN networks. If one party in a conversation is utilizing a terminal that is not an H.323 terminal, then the call must pass through a gateway in order to enable both parties to communicate.
Gatekeepers
A Gatekeeper is an optional component in the H.323 network that provides a number of services to terminals, gateways, and MCU devices. Those services include endpoint registration, address resolution, admission control, user authentication, and so forth. Of the various functions performed by the gatekeeper, address resolution is the most important as it enables two endpoints to contact each other without either endpoint having to know the IP address of the other endpoint.
Gatekeepers may be designed to operate in one of two signalling modes, namely "direct routed" and "gatekeeper routed" mode. Direct routed mode is the most efficient and most widely deployed mode. In this mode, endpoints utilize the RAS protocol in order to learn the IP address of the remote endpoint and a call is established directly with the remote device. In the gatekeeper routed mode, call signalling always passes through the gatekeeper. While the latter requires the gatekeeper to have more processing power, it also gives the gatekeeper complete control over the call and the ability to provide supplementary services on behalf of the endpoints. innovaphone device work in "gatekeeper routed" mode.
Mapped to innovaphone devices these H.323 elements would correspond to:
- Terminal -> IP phone
- MCU -> innovaphone gateway working as conference server
- Gateway -> all innovaphone gateways.
- Gatekeeper -> all devices running the innovaphone PBX
H.323 - Signaling
RAS
RAS (Registration, Admission and Status) is a communication protocol between a H.323 Terminal and a Gatekeeper. Unlike the other H.323 signalling protocols, RAS uses UDP as underlying transport protocol.
The main functions of RAS are:
- Gatekeeper Discovery
- Registration of Terminals at the Gatekeeper
- Call admission and address resolution
When an endpoint is powered on, it will generally send either a gatekeeper request (GRQ) message to "discover" gatekeepers that are willing to provide service or will send a registration request (RRQ) to a gatekeeper that is predefined in the systems administrative setup. Gatekeepers will then respond with a gatekeeper confirm (GCF). If a GRQ has been sent the endpoint will then select a gatekeeper with which to register by sending a registration request (RRQ), to which the gatekeeper responds with a registration confirm (RCF). At this point, the endpoint is known to the network and can make and place calls.
When an endpoint wishes to place a call, it will send an admission request (ARQ) to the gatekeeper. The gatekeeper will then resolve the address and return the address of the remote endpoint in the admission confirm message (ACF). The endpoint can then place the call.
H.225
Once the address of the remote endpoint is resolved using RAS, the terminal will use H.225 in order to establish, control and end a H.323 call. The H.225 call signalling is based on the call setup procedures for ISDN, described in the Q.931 / Q.930 standards. Simplified one can say that the H.225 represent an IP implementation of the ISDN D - channel methods.
In the example above, we will discuss the basic signalling methods in H.225. Also we will concentrate on the "gatekeeper - routed" mode, since this is the common method used with innovaphone devices.
The call is started by Alice sending a SETUP message (1) to the gatekeeper. A SETUP ACKNOWLEDGE message (2) notifies the caller that the request is being processed. The gatekeeper will forward the SETUP message (3) to Bob's terminal, normally resulting into a ringing tone being played on the phone. This is indicated by the Alerting message (4). If Bob picks up the call, a Connect message (5) is sent to the gatekeeper and then gets forwarded to Alice.
The Call Termination is signalled by the Release Complete message (6).
H.245
While H.225 is used to signal the remote terminal a call request, it lacks the methods for opening RTP channels needed for the transport of voice/video data. This task is performed by the H.245 protocol.
The main functions of H.245 are:
- exchange of terminal capabilities (e.g. supported audio codecs)
- master/slave determination
- establish, control and terminate logical channels (RTP/RTCP)
Of the functionality provided by H.245, capability negotiation is arguably the most important, as it enables devices to communicate without having prior knowledge of the capabilities of the remote entity. H.245 enables rich multimedia capabilities, including audio, video, text, and data communication. For transmission of audio, video, or text, H.323 devices utilize both ITU-defined codecs and codecs defined outside the ITU. Codecs that are widely implemented by H.323 equipment include:
* Video codecs: H.261, H.263, H.264
* Audio codecs: G.711, G.729, G.729a, G.723.1, G.726, G.722, OPUS
When an H.323 device initiates communication with a remote H.323 device and when H.245 communication is established between the two entities, the Terminal Capability Set (TCS) message is the first message transmitted to the other side.
Master/Slave Determination
After sending a TCS message, H.323 entities (through H.245 exchanges) will attempt to determine which device is the "master" and which is the "slave." This process, referred to as Master/Slave Determination (MSD), is important, as the master in a call settles all negotiation conflicts between the two devices.
Logical Channel Signaling
Once capabilities are exchanged and master/slave determination steps have completed, devices may then open "logical channels" or media flows. This is done by simply sending an Open Logical Channel (OLC) message and receiving an acknowledgement message. Upon receipt of the acknowledgement message, an endpoint may then transmit audio or video to the remote endpoint.
H.323 Fast Connect and H.245 Tunneling
Fast Connect (FC)
The first thing to improve was the rather long H.245 message handshake and the support of "early media". Early media is a term used for the setup of RTP channels between the communication partners, before the call has been accepted (Connect) by both endpoints. This feature is used to play announcements or dialtones to the waiting caller.
As shown in the upper right picture, the OLC (Open Logical Channel) message, a H.245 message, is sent encapsulated in a H.225 message (Connect, Alerting). The drawback of Fast Connect is that it can be used only in homogeneous (all device are compatible) environment. By sending the OLC without initially checking the capabilities (TCS) of the remote terminal, it is assumed that both terminal support the same set of capabilities.
As shown in the picture above, the RTP Stream (red) goes directly from endpoint to endpoint. However some customer scenarios require the voice stream to pass through the PBX (e.g. firewall traversal). innovaphone defined the term 'Media Relay' for gateways working in this mode. The redirection of voice data through the PBX has one main disadvantage, it creates a high CPU load on the 'relaying' gateway. Therefore this option is usually off by default and must be enabled manually.
H.245 Tunnelling
This method is the logical enhancement of the Fast connect procedure, since it encapsulates not only the OLC messages but every H.245 message in H.225 messages. As a result the separate H.245 TCP connection between the conversation partners is not needed. This saves processing power as well as TCP sockets on the innovaphone hardware, and also eases firewall traversal.
Extended Fast Connect(EFC)
EFC fastens the renegotiation of logical channel attributes during a conversation (e.g. a terminal is put on hold and receives MoH). Instead of running through the complete H.245 handshake process, a change of RTP attributes is done by sending a single OLC message to the remote endpoint. Upon it's receipt, the terminal will close the old logical channel and open a new one using the newly obtained parameters.
Each of these enhancements to the original H.323 protocol is implemented by innovaphone. To improve interoperability with 3rd party vendors, it is possible to disable FC and H.245 tunnelling at the GW - interface.
H.323 - TCP/TLS
An additional advantage of using one TCP connection, is that it can be encrypted using TLS.
The drawback of using H.323 over TCP/TLS is that the RAS Gatekeeper Discovery cannot be used. Since the phone/endpoint cannot discover its gatekeeper(PBX) in the LAN, each endpoint must be configured with its gatekeeper IP-address. This can be distributed by DHCP to innovaphone phones, if a larger number of endpoints must be supplied with the gatekeeper IP-address.
SIP
SIP works in concert with several other protocols and is only involved in the signalling portion of a communication session. SIP is a carrier for the Session Description Protocol (SDP), which describes the media content of the session, e.g. what IP ports to use, the codec being used etc. In typical use, SIP "sessions" are simply packet streams of the Real-time Transport Protocol (RTP). RTP is the carrier for the actual voice or video content itself.
SIP is similar to HTTP and shares some of its design principles: It is human readable and request-response structured. SIP shares many HTTP status codes, including the familiar '404 not found'.
SIP Architecture
- the User Agent Client (UAC), which sends messages and answers with SIP responses,
- the User Agent Server (UAS), which responds to SIP requests sent by the peer.
SIP also defines server network elements. Although two SIP endpoints can communicate without any intervening SIP infrastructure, which is why the protocol is described as peer-to-peer, this approach is impractical for a public service. There are various implementations that can act as SIP servers:
Proxy Server:
A Proxy Server is responsible to route incoming call requests to the intended recipient.
Upon receiving of a call (msg. 1) from one UA (Alice), the SIP Proxy looks up the address of the callee (Bob) at the registrar responsible for this domain (msg. 2). Then the server will create a new SIP session to Bob and forward signalling messages between both endpoints. This corresponds to the "Gatekeeper - routed" mode in H.323, as the Proxy remains always in the communication path.
Registrar Server:
Innovaphone combined the SIP Proxy server as well as the Registrar server functionality in the PBX software. As the PBX is also H.323 capable (Gatekeeper), it can communicate simultaneously with both H.323 and SIP clients.
SIP - Signalling
The most important request are INVITE and ACK used for call establishment, respectively BYE for call termination. The number of responses is rather vast, however the most used is 200 OK to successfully confirm a request.
Have a look at these wikipedia articles for a complete list of the SIP requests methods and SIP response codes.
SIP Call Establishment and Call Termination
Call Establishment
As shown in the picture above, Alice initiates the call by sending an Invite message (1) to the Proxy Server of the domain abc.com. The server answers Alice Invite using a 100 Trying message(2), indicating that the requests is being processed. In order to route the call to the correct IP - endpoint, the call server must request Bob's current location (3) at the Registrar responsible for the abc.com domain. After successfully completing the IP - address lookup, the proxy forwards the Invite message (4) to Bob's phone.
The called UA will respond with a 180 Ringing message(5), indicating that the call request was accepted and the phone is ringing. Upon receiving the 180 Ringing message, the proxy server will forward it to Alice UA. Finally a 200 OK message (6) is sent to the server, when Bob picks up the phone. As with the previous 180 Ringing, this response is also forwarded to Alice. The receipt of the 200 OK is confirmed using an ACK message (7) by Alice.
As shown in the graphic above, there are certain messages (Invite, 180 Ringing & 200 OK) which have an additional encapsulated SDP packet. The SDP part is used to negotiate the RTP parameters for this session.
The Invite SDP message contains a list of the UA's supported codecs and the IP-address and port number used to receive RTP packets. It's up to the called UA to choose the codec to use for the session. The codec selection is done by comparing the received codec list with the own list, and selecting a preferred codec from the subset of both lists.
Using the exchanged SDP information, two RTP channels (one for each direction) are set up between the UAs. The RTP packets do not pass through the SIP proxy but go directly between the communicating endpoints.
Call Termination
Eventually one of the conversation participants (in our example Alice) will hang up, resulting in a BYE message (8) being sent to the SIP proxy. The message is forwarded to Bob and confirmed using the 200 OK (9) response. Both UAs will now clear their RTP channels and return to idle mode.
H.323 vs. SIP
The H.323 specification was written by the ITU -T (International Telecommunication Union), a group of telecommunications specialists who also developed the ISDN standard. Therefore H.323 heavily relies on message formats and structures used in ISDN (i.e. Q.931) and offers a good interoperability to PSTN networks.
The SIP standard was introduced by the IETF (Internet Engineering Task Force), a group of network protocol specialist responsible also for other famous network protocols like HTTP. This is also the reason for the similarity between HTTP and SIP request and response messages.
As it is now, both protocols have reached a mature state and are constantly developed and improved. It is very probable that both protocols will also coexists in the future.
For a complete comparison of H.323 and SIP, please have a look at this article.
Within the innovaphone PBX, H.323 should be used always.
SIP should be used in these cases:
- connection to a SIP provider
- external federation to non-innovaphone systems
- connection to a 3rd party devices (even if the device supports H.323, as SIP interoperability is usually better than H.323 interoperability)
Media and NAT
Media data (voice) however is exchanged directly between the involved endpoints. One of the big issues with VoIP communication across private network boundaries is the successful transmission of media data. In many scenarios, RTP data is not transmitted correctly - resulting in so-called "one-way" audio issues.
To understand this problem and the approaches to fix it, let us have a look at a simple scenario. We have a PBX with 2 phones in a LAN.
During call setup, Phone A will send its own IP address to the PBX, which will in turn send it to Phone B. Phone B will answer with its own IP address. In the end, both phones know their respective peer's IP address and can send RTP data.
Fair enough. Now let's think Phone A is calling a destination in the PSTN instead. Again, Phone A will send its own IP address to the PBX, which will in turn send it to the SIP Provider Call Server which will send it to the SIP Provider Media Gateway. The SIP Provider Media Gateway will answer with its own IP address. Unfortunately, in the end, the phone and the media gateway still know their respective peer's IP address, but can not send RTP data.
Why that?
Both the Provider Call Server and the SIP Provider Media Gateway obviously are meant to be publicly accessible. The private PBX and phones of course are not. For this reason, they live in a private network, with private IP addresses not available from the internet. Although Phone A sends its own (private) IP address to the SIP Provider Media Gateway, and this gateway then sends its RTP data to this IP address, the data will never arrive at Phone A. Phone A however will send its RTP data to the public IP address of the SIP Provider Media Gateway and this will succeed. So we end up in one-way audio.
We will look at some approaches to fix this issue in the next chapters.
NAT
The answer is two folded.
NAT
First of all, the Router in our scenario employs a technique known as Network Address Translation (NAT). Put simply, here is how it works:
- the PC sends an IP packet (an HTTP GET REQUEST in this case) to the Server
- the packet is sent to the Router which forwards it to the internet. However, it replaces the IP packets source address (which is the private IP address of the PC) with its own public IP address
- the router keeps track of this process (it remembers a NAT mapping)
- the IP packet arrives at the server
- the server sends its response to the source address received in the previous request packet (note that the PC not even tries to send its own IP address to the server)
- the response IP packet returns back to the router
- the router sees a packet sent to himself (remember, the server has sent the packet to the source address of the request packet and that had been replaced by the router with his own public IP address)
- of course, the router has no real use for the packet. Fortunately, as it had kept a record of the previously allocated NAT mapping, it knows where the original request for this server came from. So it replaces the destination address of the response (which is currently its own public IP address) with the remembered private IP address of the PC and forwards it to the private network
- the response arrives at the PC
So this is how NAT works.
And why is the answer two-folded?Well, it works because the server does not send the response to an IP address it has been told by the PC as part of the original request. It simply returned the response to the address where the request came from (as seen by the server).
NAT Detection
This is known as NAT Detection and works quite simple. Whenever the SIP Provider Media Gateway receives an RTP packet from the phone, it examines the IP source address of this packet and adjusts the IP address it sends its own RTP packets to accordingly. Effectively, as soon as the first RTP packet arrives, the gateway starts sending its own RTP to the public IP address of the router (remember that the router had replaced the RTP packet's IP source address with its own public IP address).
Private to Private Communication
Unfortunately, this only works if one end of the conversation has a public IP address. Let us have a look at a slightly more complicated scenario: an IP phone calls another IP phone in a different private network (this is known as on-net call).
In this scenario, Phone A and Phone C will again send their own private IP addresses to their peers. Neither Phone A nor Phone C are now able to send an RTP packet to the other end. Therefore, the NAT Detection technique does not work. We end up with no media.
STUN
For this to work, they need to find out the public IP address of their own internet router. You may think, well, then let's simply configure it to the phones. Putting the administrative effort aside, this is is not an option. You may want to review how NAT works. Somewhere in the process, the NAT router needs to keep track of the outgoing, mapped IP packet so it can forward the returning response to the proper internal device. When the RTP packet would be sent to the remote internet router, this router wouldn't know where to forward it.
This is fixed by use of a STUN server.
The STUN server has a public IP address (it is usually run by your internet provider) and can be reached easily by both phones (in fact, both phones could have their own STUN server, doesn't make a difference).
The STUN client (i.e. the phones) will send a request to the STUN server asking which source IP address do you see in my request? The server simply reports back the IP source address it has seen in the request packet (which is the public IP address of the calling client's internet router). This way, the phones learn their own external IP address which they then can send to their communication peers. At the same time, the STUN request has created a NAT map on the internet router. This map - which is intended by the router to map responses from the STUN server - is subsequently used by the remote peer to send its RTP packets to, which are then forwarded to the correct phone.
NAT Types
You might have smelled it, this STUN trick rather is an abuse of the NAT mechanism. After all, the router creates a NAT map so that the internal client (the phone) can communicate to the STUN server. However, the map is later used to facilitate communication to somebody else - the remote phone.
You guessed it, this is looking for trouble. In fact, there are various more or less restrictive NAT implementations. Some don't bother and let the 2 phones communicate. Others however detect the abuse and block it. The phones again end up in a no- or one-way audio situation.
TURN
As the full name Traversal Using Relays around NAT suggests, the idea here is to relay media data through an entity (the TURN server) such that the media safely can traverse NAT boundaries.
One of the issues with the STUN solution we discussed before was that the NAT map created when querying the STUN server was later misused to relay RTP media through the NAT router (which may or may not work, depending on the router implementation).
The solution is that a TURN server is installed. All clients would
- contact the TURN server prior to the call
- thereby open a NAT map towards the TURN server
- send and receive their own media to/from the TURN server (tunnelled within the connection to the TURN server)
The TURN server acts like a normal RTP endpoint on behalf of the respective client phone. .
Of course, both phones can use the same TURN server.
Note that SIP providers usually do NOT provide a TURN server (as this is a costly resource that requires both CPU and bandwidth and it is not needed for SIP trunk operation as we have seen). You will need to provide it yourself most probably. However, hosted PBX providers usually will provide it, as their service requires it.
Internal TURN Server
You may now ask yourself how could I provide TURN services, this is asked too much of my local IT!?. Surprisingly enough it is rather simple. Even if you have no possibility to provide a server which is located in the internet, it is possible.
The key is that the TURN protocol is available base on TCP. This way, you can provide a TURN server that actually sits in your local network just like you would provide a Web server. You simply configure a static NAT map in your internet router that points towards your TURN server.
The innovaphone PBX itself includes a TURN server implementation. So we can simplify the setup even more.
ICE
- the local (private) IP address of the phone which is good for conversations to another phone in the same private network
- the external address of the internet router (the address detected by a STUN server) which is good for conversations to peer in the internet
- the address of a TURN server which is good for conversations to peers in foreign private networks
So when an endpoint establishes a call, it has to decide which IP address to expose to the remote peer. Unfortunately, the endpoint usually cannot make this decision rightfully. This is because in order to do a good decision here, the endpoint would need to know the details of the network path to the remote peer. However, this is usually not known to the endpoint.
This issue is addressed by Interactive Call Establishment, or ICE. The idea behind ICE is that instead of publishing the IP address, the endpoint would note a whole set of addresses which may or may not be useful for media transmission. These addresses are known as candidates. The peer would therefore receive a set of possible candidates and would in turn response with a set of its own candidates. The endpoints then would exercise all possible combination of these candidates and try to send and receive media data (don't worry, the choices are tried in parallel, so it won't take too long).
Moreover, the candidates are tagged with their type, so that the peers could prefer more desirable candidates. Just think of an internal between two phones on the same LAN. For sure, using the TURN server would work. But it would be clearly undesirable, as a simple intra-LAN communication (hence using the phones respective private IP addresses) would do.
We end up with a scenario, where all techniques (STUN, TURN) are available and the right choice is done with ICE:
The ICE infortmation (candiates) is exchanged between the endpoint using the signalling connections (H.323/H.245 and SIP/SDP) in this case.
Summary
Signalling protocols:
- H.323
- SIP
- RTP
The recommended signalling protocol in most innovaphone environments is H.323.
recommended Web links:
H323 vs SIP