Course12:Advanced - Redundancy

From innovaphone wiki
Jump to navigation Jump to search
There are also other versions of this article available: Course11 | Course10 | Course12 (this version)

This book covers various PBX redundancy issues.

Overview

Redundancy deals with making sure, users can always call - even if part of the infrastructure fails. Sounds easy, but can get pretty complicated at times!

So let's see what it takes to make a call.

First of all, for a user to call, he needs a phone. If the phone fails, it can be replaced, or the user can screenshot.png use someone else's phone. That was easy.

Now let us get a little more serious and have a look at a screenshot.png more complete picture. Obviously, in addition to the phone, it takes a PBX where the phone is registered and either another IP Phone registered with this PBX or - if external users are called - a trunk line.

In addition to what we have seen before, we need to care for the event that a PBX breaks down as well for the event, that a trunk line breaks down.

To make sure there is always a PBX the phones can register with, we can screenshot.png duplicate the PBX using a so-called standby-PBX. We actually have discussed this in the basic training already. Looking at the picture, we see that both the PBX and the standby-PBX somehow need to be connected to the trunk line. In many cases, there is only one trunk line, so this one line must somehow be connected to both. And even if you have a trunk line bundle, you probably want to be able to use all of them, no matter if the PBX or it's standby runs. We will see how this works in the next chapter. Btw: As far as the trunk line is concerned, it does not matter if this is a traditional TDM line or one of these fancy new fully-virtual SIP trunks. The issues are more or less similar.

Then again, one beauty of using VoIP for telephony is, that it allows us to flexibly deal with remote locations. In other words, most often, a remote location is involved and thus a connection through a WAN. This leads to an screenshot.png even more complicated scenario.

Having a standby-WAN is usually not an option, and even if so, you would learn how to do it in a course of your WAN-gear vendor, not here. However, in many cases, WAN issues can be overcome by routing calls over PSTN temporarily. Also, the other way round is true! PSTN issues can often by dealt with by routing calls destined to the PSTN via WAN to a secondary trunk line.

Of course, in all these scenarios, the LAN is critical too. innovaphone gear with 2 Ethernet interfaces feature fish-help.png redundant Ethernet access with RSTP, but it will not be discussed in this book.

Now let us see, how things can go wrong and how to fix it.

Using a local Trunk

Let us start with the most simple scenario. A user screenshot.png calls out through the local trunk line.

When the local PSTN trunk fails, it is often desirable to re-route calls to a central trunk line. This is usually connected to the master PBX. The user would thus screenshot.png call to the master and through the remote trunk line.


Using a remote Trunk

Some customers however request it just the other way round. Users should always screenshot.png call out through a central trunk.

Only if this fails, the user's screenshot.png local trunk shall be used to call out.





Calling a remote Slave

Calling a remote slave involves three PBXs. As there is no direct connection between the two slaves, the call will flow screenshot.png from the calling slave through the master to the called slave.

In this scenario, there are two possible failures. First of all, the local WAN access may have broken down. In this case, the call should be screenshot.png re-routed through the local PSTN trunk.

However, even if the local WAN access is OK, the master's connection to the called slave may be down. In this case, there are two reasonable options:

More on WAN Outage

You may think that a WAN outage is something that should be handled by the WAN (e.g. through redundancy or otherwise ensuring high availability).

While this is certainly a nice idea, WAN links always have limited bandwidth. Sometimes, bandwidth is large enough that you simply disregard it (after all, you do not care for the case where your LAN is oversubscribed). But most times, WAN only allows a limited number of calls to be sent in parallel. So WAN outage also happens (as seen from a call's perspective), when the WAN can not provide the bandwidth for a particular call.

Assuring Access to the Trunk Line

The standby PBX concept for scenarios having only one PBX (master) is described in the previous Standby PBX book. If not yet done, please take some time and read the book.

Here comes a short summary of the previously mentioned book. The role of the standby PBX is to handle incoming registrations in case the master PBX is not working. Such an installation is pretty simple, since you have only one master and one standby PBX. The standby PBX is an exact copy of the master PBX and takes over if and only if the master PBX is not available (although the gateway level of the box running the standby PBX is always active).

However in most cases you will still have to ensure that your PBX users are able to receive and make calls to the PSTN. In other words you must provide for redundancy of your PSTN trunk line. Depending on if you are using gateways with loop-in feature (i.e. IP6010, IP810, IP3010), you can use all of your ISDN lines in the standby situation or just a part of them.

Assume a scenario where a customer has 2 physical trunk lines and asks for a redundant solution. In this case, the screenshot.png the 2 trunk lines can be connected to both gateways, one on each. In normal operation, both trunks are available. In the standby case, only one (PRI#2) is still available.

When using gateways with support for looped-in configurations as PBX platforms, a solution can be implemented that screenshot.png keeps both trunk lines available when the master PBX fails. In this case, when the box the trunk is physically attached to fails, the wires are physically looped through to the stand-by device.

If you have a single trunk line, you may use screenshot.png a combination of both types to make sure the trunk line is always usable.

Assuring Registration in a Master/Slave scenario

Let us look closer at PBX-redundancy in a master/slave installation.

Each registration takes place at the PBX defined in the PBX property of the PBX object the device uses for registration. Let us have a look at the

screenshot.png Master/Slave Scenario

Phone P2 will send it's registration request to the local PBX S1. If this is in fact the registration PBX defined for P2, then we're done. However, if P2 has a different registration PBX set (say, the slave PBX S2), then the request would be routed to the master PBX M and further to S2 (thereby setting P2's physical location to S1, you may want to read further in chapter Physical Location of the Distributed PBX).

The basic principle of how registration redundancy is implemented is very similar to the Master/Standby solution. Just like an endpoint registers with the PBX, the slaves register at their corresponding PBX object at the master PBX. As with all H.323 registrations, this registration is refreshed every 2 minutes. If the re-registration fails and there is a secondary gatekeeper (e.g the master's standby PBX) defined and available, it will register there. However, if not, the slave assumes that the master is down. In this case it checks if it has the proper standby license. If this is true, it will accept registrations for all known objects (i.e. users).

Looking at above scenario (P2 tries to register with S2), P2's registration will actually take place at S1 whenever the master and its standby are not available.

Likewise, when the master M knows that S2 is unavailable, P2's registration redirected from S1 to M is accepted by the master and thus P2 is registered with M (again, subject to proper licensing).

The amount of PBX objects known to a slave PBX is the major difference to the master/standby scheme. While a standby is always a full replica of it's primary, a slave PBX often only wants to know it's own users and is thus partially replicated only.

Care must be taken how the registering device's Primary Gatekeeper Address and Secondary Gatekeeper Address is configured. When both PBXs designated as primary and secondary registration target are local to the device's LAN segment, no configuration is required, as both will be found using gatekeeper discovery. If this is not the case, you need to configure both options (usually the local slave and the master) properly.

One further remark about gatekeeper discovery: discovery is used when no gatekeeper is configured. As there are 2 properties to define the gatekeeper (Primary Gatekeeper Address and Secondary Gatekeeper Address), there are actually the following options:
       
  • both properties are left empty: discovery will be used
    •  
  • Primary Gatekeeper Address is left empty and a Secondary Gatekeeper Address is set: discovery is done and if - after a timeout - no GK has been discovered, the secondary is tried
Gatekeeper Discovery does not work with TCP based registration methods (H.323/TCP, H.323/TLS, TSIP, SIPS)!

Also, care must be taken that the designated secondary PBX in fact can handle both the number of PBX object definitions required and the number of registrations required in the standby-situation.

Redundancy for the master PBX


The most common method of securing the master PBX is a screenshot.png dedicated standby PBX.

In case the additional costs for a dedicated standby PBX are too high, you can use one of your slave PBXs as standby PBX. If the slave PBX is equipped with standby licenses it will work as a standby PBX in case the master PBX is down. However you must consider the amount of users that will register in the standby case at your slave PBX.

screenshot.png standby-using-slave


In the example above the slave PBX is perfectly able to work in normal operation mode. However in the standby case, the slave PBX, i.e. the IP411LEFT, is undersized and can not handle the additional 1000 registrations. You have to consider this point when designing redundancy scenarios using slaves. You could now use an IP6010 as slave - PBX, the IP6010 would be oversized for the normal operation but could handle the additional registrations.

Redundancy for the slave PBX

Ok, now you have secured the master PBX. But what if one of the slave PBXs malfunctions? In this case you will want to have another PBX accepting the registrations of the slave PBX users. This could be any PBX in your setup, in most cases it is probably the master PBX. In innovaphone speak this scenario is called n+1 redundancy. The reason is that the master PBX must be able to handle its own users and the users of the biggest slave in the installation.

screenshot.png nplus1-master


The n+1 installation example above shows that the master PBX has been chosen too small. It can handle its 500 users during normal operation but will break down if it has to accept the standby registrations from the slave IP810. As a result the master PBX must be able to handle 1000 users, i.e. a more performant gateway, like an IP6010, is needed as master in this case.

Perhaps you can take some time and have a look at the screenshot.png different PBX modes. You will discover a fourth mode called Standby-Slave. This mode is used for configuring a dedicated standby PBX for a slave. In case the slave goes down, the standby-slave will take over and register itself at the master PBX. The standby-slave is mostly used in scenarios with large slave PBXs.

screenshot.png standby-slave



By using a dedicated Standby PBX for your slave, you prevent the registration of 1000 IP - phones going over the WAN connection, which could create bandwidth problems. Secondly you preserve the 2 - PRI lines in the slave location and by this prevent a bottle - neck situation on the master ISDN lines.

Redundancy for WAN breakdown


In case the WAN connection between the Master and the Slave PBX is interrupted, you will want the slave PBX to handle all registrations of phones in its remote location. As stated at the beginning of this book, a slave PBX will accept registrations for known users, if the master is not reachable. The problem is that in some cases it might not know about its local users. This might sound confusing at this point, so lets look at two common scenarios, in order to understand the problem and its solution.

User normally register at slave PBX


If the users registers at the slave PBX during normal operation, the standby operation won't create problems. The PBX objects are assigned to the slave PBX and thus are replicated from the Master PBX. If the WAN connection is interrupted, the local phones will remain registered at the local slave PBX. The local users will be able to talk to each other and use the available trunk lines to the ISDN. Communication with other remote locations is only possible via ISDN.

User normally registers at master PBX


Sometimes it is necessary that some or all users register at the master PBX (groups for waiting queues could be such reason). Consider an example where we have a master and a slave PBX. The slave PBX, as it is smaller and less capable than the master PBX, is partially replicated only.

screenshot.png slave user registers at master

User2 is configured to register with the slave, so the PBX property is set to the slave. Because of this, user2's PBX user object is replicated to the slave (if not, registration would not be possible of course). When the slave fails, it's secondary gatekeeper is set to the master so registration takes place there. Pretty straight forward.

User1 is different. It is configured to register with the master. Still it sends it's registration to the slave, which re-routes it to the master adding a physical location information. The master of course takes the registration. Normally, user1's PBX user object is not replicated to the slave (as it does not have the slave set as PBX property in the user object settings). However, due to the fact that the master receives a registration for user1 from the slave, it initiates a replication of the PBX user object to the slave. When the master fails, the slave now has the user object for user1 at hand and can take the registration on behalf of the master.

You could be tempted to simplify the setup by configuring the master directly as primary gatekeeper and the slave as secondary gatekeeper for user1. This however will leave the slave without the user object of user1 when the master fails (as no replication has been triggered) and the slave would be unable to take user1's registration in the standby-case! Also, doing so would leave user1 with no known physical location in normal operation, which might cause routing issues (more on routing calls can be found in the book on distributed PBXs).

WAN Overflow

If you have a limited WAN bandwidth, you have to think about overflow situations for VoIP calls going from and to a location.

Let's have a look at a situation with master and slave and a screenshot.png limited WAN bandwidth. In this scenario, there is limited bandwidth available between master and slave and you have calculated that the bandwidth is good for 5 parallel calls.

What happens now with a sixth call?

If no overflow mechanism is configured, it will simply be established, and all six calls will compete for the available bandwidth. So not only this call will suffer from insufficient bandwidth, all others will do too. Of course, you may consider using a low-bandwidth codec. However, this reduces quality and also does not solve the problem really (as even then, after some more calls, the same problem will occur).

Limiting extraneous Calls

It is thus important to block any excess calls, as it is better to have 5 good calls than 6 bad ones. To do this, you need to configure the maximum number of allowable calls
With this configuration, the extraneous call will be blocked:



You think this is not a big deal? Remember that with this configuration, only user may be frustrated, but 5 happy callers are not. Without the configuration, all 6 will likely be frustrated.

Rerouting calls through PSTN

We have seen that blocking extraneous calls make sense and is thus recommended. However, smart customers may say: "so when all WAN ressources are used, why not re-route the call through the PSTN?".

Right so!

Let us first look at the situation on the slave. In the Slave PBX area of the slave's fish-help.png PBX/Config page, there is a screenshot.png Route Master calls if no Master to property. This tells the PBX to re-route calls that cannot be sent to the master due to WAN issues to the given object. Please note that this is one of the occasions where you must enter the Long Name of a PBX object, instead of its Name (a.k.a. H.323-id) . These calls are then routed to the active registration on the object - which implies that there must be a registration on it.

For the other direction (master to slave), the overflow is configured as a fish-help.png Call Forward on Busy (CFB) set screenshot.png on the PBX object representing the slave on the master.

With this configuration, when the number of allowable calls is reached, extra calls will be re-routed.



WAN Outage vs. Overflow

Although from a call's point-of-view, it doesn't matter whether if it can't go through the WAN because it is overloaded or because the WAN is somehow broken, the PBX differentiates these 2 situations. So generally, WAN outage is not WAN overflow!

When configuring the screenshot.png re-routing object in the slave (for slave-to-master re-routing), it will normally be used both on WAN-overflow and -outage. However, you can disable it by checking the No Reroute check-mark next to the Max Calls to Master property. If so, when the maximum number of calls is reached, extra calls are rejected, not re-routed.

For the other direction (master to slave), a screenshot.png CFB configured on the slave object is only used for WAN overflow. To re-route on a WAN outage situation, a screenshot.png CFNR must be configured (alone or additional to the CFB). You may recall that in a WAN outage scenario, the unavailable slave will loose its registration with the master. At this point, the master will know that the slave is not reachable and thus will trigger the CFNR without waiting for the no-reponse timeout.

Number-Handling when Re-Routing

The basic idea of the re-routing mechanism is that in case an internal extension in a remote location cannot be reached via VoIP, the PBXs are configured to route the call via the PSTN. In order for this to work, the PBX initiating the PSTN call must know the external PSTN number corresponding to the internal called extension.

This is not as trivial as it appears to be!

For the re-routing, we need to know the full subscriber number of the receiving PSTN line. If a slave does the re-routing, how could he know the numbers of the target PSTN line? Suppose a scenario with a 4 locations. All of them have a PSTN trunk line. Slave 1 attempts to re-route a call to extension 40:




When re-routing a call on Slave 1, the PBX will In fact not even know the target location, let alone the target locations full PSTN subscriber number! This is because only the master knows all the objects in the whole system. Also, only the master PBX knows the PSTN numbers associated to each location (configured as CFB/CFNR on the slave objects).

The bottom line, all calls must be re-routed from slaves to the master's PSTN (with the called extension simply appended to the master's trunk subscriber number, +4970317300940 in our sample). The master will then re-route to the appropriate slave.

Still, there are some pitfalls.

In an ideal world, all locations have DDI-capable trunk lines that support as many extension digits as the overall numbering plan requires. In our above sample setup, the numbering plan has 2 digits (-0 to -99). Master and Slave 2 both have DDI trunks which support 2-digits extensions. However, Slave 1 only supports a single digit. Even worse, Slave 3 does not support extensions at all, as it has a point to multi-point trunk with 3 MSNs only. Both for Slave 1 and Slave 3, it is not possible to call a specific extension directly through the PSTN thus. Usually, in such locations, this is handled by some gateway level routes, which map MSNs or DDI of incoming calls to some local extensions. In such a case, you need to reserve a specific MSN or DDI number on the affected slaves which is mapped to sort of an (IVR- or manual-) switchboard service. This can then be used for re-routing purposes.

While this is fixable, things can get even worse!

Mapping Extensions to available external Numbers



Assume, we have a structured (that is, multi-node) numbering plan (for numbering plans, have a look at the Distributed PBX book). In this case, when a user in one node calls an extension and this call has to be re-routed, it obviously does not make sense to just append the called extension to the targets trunk subscriber number. This is because on the target, which may reside in a different node of the node-tree, this extension may be interpreted differently and the wrong extension called.

To fix this, you have to make sure that for re-routing, the called party's full number from the root (including all node prefixes) is sent. This way, the right destination can be called on the remote side.

OK great, but how do we get at the full called party's full number from the root when the calling user just calls the relative called party number seen from his own node?

The answer is: the (called- and calling-) party numbers sent to the object for re-routing is the number as seen from the re-routing object. So if the re-routing object resides in the root node, then these numbers are exactly what we asked for: the full called party's full number from the root. Of course, if we have a flat numbering plan and everything is thus in the root node, then it doesn't matter (as then the re-routing object obviously is in the root node too). However, it is a good habit to always have those re-routing objects reside in the root node, so you get at the full numbers.


How Re-Routing Numbers are used


You might have guessed it, we have a similar issue on the master's side when we use the CFx to handle re-routing to slaves.

There are two issues here.
  • How is the number set as CFB or CFNR interpreted?
  • How is the called extension number created which is appended to the CFB/CFNR?

The number set as call forwarding (CFB or CFNR) at the slave's PBX object is interpreted in the node context of the called node.

In fact that, this is not very much surprising, as call forwards set to any object always are interpreted in the context of the object (i.e. user) that has set the CFx. After all, if you set a call forwarding on your phone (that is, for your user object at the PBX), you would certainly expect that the number given is used just like as it would be used when you simply dial it from your phone.

So if the slave PBX's PBX object is in the root node (which is very often, but not necessarily the case), then the forwarding number is interpreted in the root node context.

So then, how is the called extension number (which is appended to the call forwarding number) created?

The answer is: appended is the number that needs to be dialled to reach the original destination from the node the PBX object defines.

Phew, what does that mean?

All root node example



Let us have a look at a screenshot.png scenario with a master, 2 slaves and all users in the root node.



There are 3 users (one in each location/PBX) and we have set the re-routing information at the PBX objects (please note that the CFx set on the master PBX object are just informational, they will never be executed (as the master will never try to call itself)). Also, there are trunks in all locations.

For some reasons, both slaves are not reachable.

Now screenshot.png user calls user1.
user (who is registered with master) calls user1 (who is registered with slave1), user1's extension (11) is appended to the CFNR configured (00304711) at slave1.

This is because the failing slave1 defines (as all PBX objects do) its own node slave1 which is (as per Parent Node property of slave1) in the root node. Seen from this node (slave1), to call user1 (which is in the root node) 11 is exactly what you need to call to reach user1.

Something very similar happens, when screenshot.png user calls user2.

The only difference is that now the CFNR for slave2 is used (as user2 is registered with slave2).

As we can see, when all objects are in the root node, things are pretty straight forward - at the end of the day, the called extension is appended to the number given as CFx.

Multi-Node Example


Of course, things can get more complicated. Suppose a slightly modified screenshot.png scenario, where each location has its own node,

that is, a non-flat numbering plan.

When screenshot.png user calls user1, then he obviously has to call *211 (as opposed to merely 11). This is because user and user1 reside in different nodes, so the node prefix must be dialled.

Surprisingly enough, this node prefix is not passed to the trunk during re-routing! Why this?

As we stated above, "appended is the number that needs to be dialled to reach the original destination from the node the PBX object defines". The "node the PBX object defines" in this case is slave1 (as we are calling user1 which is registered with slave1, so the re-route takes place at slave1). To call from node slave1 to another object in node slave1 (which is the node the called user1 is in) obviously does not require a node-prefix!

This is exactly what we like to see: user1's extension is sent to the slave1 location right away. So when the call finally gets to slave1-trunk at the remote side, the pure extension can (after having stripped the trunks subscriber number as usual) be thrown directly in to the PBX (as slave1-trunk is in the slave1 node, which is the node of the called user1). Again. very straight forward!

A more interesting case is when screenshot.png user calls user2. Both user and user2 are in the same node (master), so user simply calls 12 to reach user2. However, for some reason, user2 has been set to register with slave1. As this slave is unavailable, again the CFNR set at slave1 is used to re-route and the call is sent to 00304711.

The interesting part is that the extension appended now is *112, which is not what the user had dialled. The reason is that, despite the fact that user2 is registered with slave1, it is not in the slave1 node. Instead, it is in the master node. To reach an object in node master from node slave1, we need to use the node-prefix of master, which is *1. So we end up with *112 being received as extension when the call appears at the remote slave1-trunk.

This is the right thing to do, as just passing the called extension 12 would create a conflict with a possible object with number 12 in node slave1.

Still, this scenario may create problems:
  • the PSTN may refuse calls with * or # in the called number
  • the PSTN may truncate the called number as it gets too long (depending on your provider)

WARNING V9 Upgrade

With later V9 and V10, the behaviour when re-routing as described in this book subtly changed (actually, before it was not as described here).

It is now more straightforward and easier to implement. However, multi-site installations can thus not easily be upgraded!

See fish-help.png Changed Semantics of Slave-PBX Rerouting in V9 HF24

WAN Overflow and E.164 Setups

In E.164 setups, WAN overflow is handled differently. See the book on E164 PBX Setups for details.

How Call Counters work

To implement the Busy On property in fish-help.png PBX/Objects/PBX and the Max Calls to Master property in fish-help.png PBX/Config/General, the PBX obviously needs to count calls.

While this sounds trivial (Hey, just count 'em!), it is not.

Counting Scheme



If there is a call from a slave to a master site (or vice versa), then we obviously have one call between these two sites. However, suppose a scenario, where there is screenshot.png a master and two slaves. If a user from one slave location calls a user in the other slave location, then what calls do we have now? You could be tempted to think, well, we obviously have one call between the two slaves. But not so!

While this is true for the media stream between the two endpoints, it is not true for the signalling path between the two endpoints. Signalling goes from the calling slave to the master to the called slaves. The PBX in fact is not aware of the details of the media stream, as this is negotiated autonomously between the calling endpoints. The only option for the PBX to count the calls is to count the signalling connections!

In our little scenario, we thus have
  • No call between the two slaves
  • One call between the calling slave and the master
  • One call between the master and the called slave

Of course you could think this is a problem, cause what you in fact intend to limit is the RTP (that is, media stream) data between sites (as signalling data is just so little that it may as well be neglected). You may think of a network design, where in our scenario, no RTP data flows to/from the master at all, but a WAN link with narrow bandwidth is used between the 2 slaves.

While this in theory is true, in real life the simplified model (media stream follows signalling path) is good enough.

Call Limits


Ok, now that we have understood what actually is counted as a call, the question is: are all calls equal? And the answer is: yes. A call is a call is a call.

The PBX does not account for thin or thick media streams. As said before, media stream properties are negotiated by the calling endpoints and the PBX just does not know. Even worse, media streams may change during a call. For these reasons, the PBX again uses a somewhat simplified model: a call is a call.

So when you calculate the call limits, then you have to consider the worst case media stream requirements.

NB: you can find a discussion of the various codecs and their bandwidth requirements in the VoIP Protocols book of your basic training.

Asymmetric Counting


The Max Calls to Master property in the PBX config works for calls from the slave towards the master. The Busy on property in the PBX object works for calls from master to slave but also for calls from slave to master. This allows for asymmetric configurations. Since the Busy on property is checked also for calls from slave to master, its value should always be greater-than-or-equal to the value of the Max Calls to Master property.

Currently their is no practical benefit from a asymetric counting, so all installations should have equal values of both counters.

Exceptional Scenarios

Sometimes, strange things happen which make call counting even more adventurous.

Calls "slipping through"


We said above, that the PBX is counting signalling connections. While this is true, care must be taken. More precisely, the PBX is counting signalling connections between PBXs. A signalling connection that does not flow through a PBX is never counted for.

In our little scenario, suppose a user which uses a software-phone which is registered with the master PBX. Now if this user visits the slave location and uses the software-phone, then it would actually be registered with the master PBX. Even if the phone is using gatekeeper discovery, so it finds the local slave PBX, this PBX would redirect the registration to the master (as this is what is configured for this user).

When the user now calls, then this call bypasses the local slave PBX and goes directly to master PBX. This call is thus not counted as a call between the slave and the master!

If the call goes to a destination that is registered with the master too, then the call is not counted at all. If the call goes to a destination that is registered with the other slave, then the call is not counted between the calling slave and the master, but it is counted between the master and the called slave. To make things completely ridiculous, if the call goes to a destination that is registered with the same slave the software-phone user visits, then the call is counted between the master and this slave - although media data will never leave the site!

Weird scenarios


Remember the scenario, screenshot.png where the master's link to a called slave failed ? What about call counting here?

If we apply what we have learned so far, then we end up with 2 calls between master and slave, although in fact we have zero. This is because we have 2 signalling connections between the 2 PBXs, but ultimately the media stream will stay in the slave location (between phone and PSTN gateway).

There are a number of such scenarios, where for one or the other reasons calls need to be pushed through the PBX tree before the final destination can be determined. To accommodate for this, a PBX will try to detect these situations and instead of counting the returning call twice, will lower the call counter.

This is done by inspecting the unique call identifier that comes with a call and thereby associating call legs belonging to the same call. This will work most of the time, but some scenarios may not be detectable, especially if 3rd party devices play a role in the scenario.

Also, the "correction" of the call counter when the call returns is done only once the call returns. Before, the counter is incremented. If it reached the limit due to this increment, and another call is tried before the first call returned, the new call will be rejected!

Handling PSTN Overflow

In some cases the amount of available channels of the PSTN trunk line might not suffice the user demands in a location. For users in slave locations, there are - depending on the trunk line strategy - two major scenarios:
  • users primarily use their own local trunk line. A central (master) trunk line is used for overflow handling
  • users use a central (master) trunk line. The local trunk line is used for overflow handling

Of course, a number of other strategies are possible. Some installation for example use 2 major sites with big trunk pipes and use their own trunk first and the respective other major site's trunk for overflow. However, the two major scenarios are sufficient to discuss the issues.

Prefer Local Trunk

Suppose a scenario, where a screenshot.png site (Slave) has its own trunk and the master too.

Handle local Overflow


The site has 50 users and 2 BRI lines (thus 4 channels) and during main traffic hours, there may not be enough channels. So what can we do to handle such an overflow situation?

The solution is to handle overflow conditions by re-routing calls which fail due to fish-help.png missing local resources (e.g. no channel available) to another ISDN line (e.g. the 2 PRI lines at the master). This can be done by configuring fish-help.png CFNR at the slave-trunk object, pointing to master-trunk. The CFNR will be activated when the Trunk Line object is receiving a No channel available message from all registered ISDN interfaces. Ok, so now we can continue to make calls into the PSTN in case our local PSTN connection is overloaded.

Reserve Channels for incoming Calls


Still, external callers from the PSTN, will not be able to reach anybody in the slave location. Some customers enforce an incoming callers take precedence policy, so they want to reserve a certain number of channels for incoming calls.

This can be done using a fish-help.png call counter in the routing engine. It lets you specify a maximum amount of allowed calls for a number of routes.

In our example, let us assume that 2 lines should be reserved for incoming calls. We thus put a call counter with the same name on all outbound routes from slave-trunk to the ISDN interfaces (e.g. BRI1 and BRI2), setting max to 2. The third outgoing call will then trigger the overflow condition and the already configured overflow handling will take place.

Prefer Main Trunk

In this case all slave users primarily use the main trunk lines (master-trunk) and the local trunk (slave-trunk) line on overflow only.

You may think, great, so we use a similar CFNR like we did in the previous example on slave-trunk, but now set on master-trunk. Nice try, but does not work generally.

The problem with this solution is that the main line is potentially used by more than one slave. So the CFNR set at master-trunk would need to depend on the calling slave. You may be tempted to think this could be made dependant on the calling number, but this is unreliable (extensions can be spread over locations with no apparent scheme, moreover, mobile endpoints may move from one location to the other, calling the respective local trunk).

To solve this issue, we can screenshot.png use the slave's local routing table re-routing mechanism. The idea is to have an extra registration (referred to as GWx in the drawing), that registers to the PBX and is used to bounce calls to slave-trunk back to the PBX first.

When doing so, you need to make sure that this bounced-back call addresses the master-trunk, so the PBX will re-route the call to the master's trunk. In our case, the way to go is to put the slave-trunk-access object in the root node (so it will address master-trunk instead of slave-trunk when calling 0).

Should the call fail at the master's trunk, the cause is reported back to the slave's routing engine and it will now re-route to the second route which is destined to the local trunk.

CLIP no Screening

Although it is not directly related to the topic of this book, you will likely run into this issue when configuring overflow scenarios: CLIP no screening.

When re-routing calls through various trunk lines, the calls' calling line id (CLI) will vary, depending on which trunk is used. This may be OK, but some customers insist on sending the "right" CLI always, where "right" depends on the location the original caller is in (e.g. use the local trunk number even if the call is in fact routed through a remote central trunk).

Two things must be done for this to work:
  • the provider must be convinced to let you send CLIs which to not match the trunk line's own subscriber number. In many case, convincing here is related to paying money
  • CLIs must be sent in the internal PBX network so that they can be presented correctly when the call is sent out to the trunk

Usually, this is accomplished by prefixing the CLI somehow when re-routing. For example, a slave with a local trunk subscriber number +49703173009 could prefix a re-routed call's CLI with 004970317300. This can on the trunk end be converted in an appropriate CLI.

Interdependancy with WAN Re-Routing

When both WAN and ISDN redundancy are desired, care must be taken to configure both correctly. To understand the issue, have a look at the screenshot.png Prefer Main Trunk scenario again.

Suppose the WAN connection breaks down or has reached its capacity. If a user local to the slave (e.g. User10) calls the trunk, the call will be sent through the Trunk Line object slave-trunk to the gateway's routing engine. The first routing option will now try to send the call to the master-trunk.

If the slave-trunk-access object has the slave PBX as it's PBX property (that is, GWx registers with the slave PBX), then this call would flow from the slave's routing level to the slave's PBX and the slave PBX will then route it to the master PBX. As this is not possible, the slave PBX will use the WAN overflow configuration (as discussed before) and re-route the call to the PSTN (via slave-trunk). So the call will be sent through the Trunk Line object slave-trunk to the gateway's routing engine. The first routing option will now try to send ... See the problem?

To avoid this endless loop, we can set the PBX property of slave-trunk-access to the master PBX. This should make sure the attempt to send a call to the central trunk fails without triggering the WAN re-route mechanism. However, it not necessarily does so! How comes?

Recall for a moment the discussion of the registration redundancy schemes discussed previously. If GWx sends it's registration request to the slave PBX (which would happen if you set the gatekeeper Address to 127.0.0.1 e.g.), then the slave would re-direct this registration to the master (which is what we want). However, if the master is not available (and standby licenses are available), the slave would take the registration anyway and accept it on behalf of the master. While this is great for registration redundancy, it causes the endless-loop issue as before here.

The solution is to configure the gatekeeper Address property of GWx to the master explicitly (and possibly the secondary gatekeeper address to the master-standby, if there is one). If the master is unavailable (and there is no standby), then GWx's registration to slave-trunk-access will fail with no fail-back, which is exactly what we want.

Call Counting


This solution is clean and neat. However, it still has a drawback: for successful calls to the master via GWx, no call counter is incremented. The slave's call counting mechanism is effectively by-passed. So if call-counting is essential, the solution doesn't work.

When call counting (as opposed to merely handling a WAN-outage) is essential, then the scheme needs a slight variation. GWx has to register with the slave PBX (so that calls do not by-pass the slave's call counting). However, when a call is re-routed back to the local trunk due to WAN overflow or outage and the loop happens, it needs to be detected in the gateway level. This can be done by prefixing calls sent through GWx and to detect the prefix both in the slave's gateway level (where such call needs to be rejected with a fish-help.png re-routing cause) and the master's gateway level (where it simply needs to be removed).

WebDav Server on the SSD

Special care needs to be taken if the internal WebDav server on the SSD drive is used in a master/standby redundancy scenario.

Using the "right" IP Address


You need to be aware that when the standby system takes over, all the PBX objects which access the WebDav (e.g. for announcements) still need to have access to the local drive, which then is the SSD of the standby.

When no Linux AP is used, this can be achieved fairly simple by just using 127.0.0.1 (local loopback) in the URLs (e.g. http://127.0.0.1/DRIVE/CF0/play/dingdong.g711).

However, when the LinuxAP is used, the SSD drive will be handled by the LinuxAP, not the gateway and thus it cannot be reached using local loopback. The solution here is described in fish-help.png Use LinuxAP Webdav Server in Redundancy Scenarios.

Synchronization of WebDav Folders


Please be aware that the PBX synchronization between master and slave does not synchronize file on the WebDav server. For static files like announcements, this is not an issue (just have a copy stored on the slave). For dynamic files like voice-mail, this may be a problem. If so, you need to employ one of the synchronization tools available in the market (e.g. Microsoft's robocopy or the freeware tool www.png GM - UniversalSync).

Application Redundancy

When planning redundancy, keep in mind that users usually expect everything to work all the time, no matter what!

More and more features and services implemented by innovaphone or 3rd-party applications are considered "essential" by users. It is important thus to design redundancy schemes for those too, if required.

However, this is not part of this book.

Calls in the Event of Failure

Let us look at what happens to the calls in the event of a failure. Generally, for a call, we have (at least) two endpoints which talk to each other. These may be phones or gateways or combinations of them.
In addition to that, we have some signalling entities (that is, master or slave PBXs) where the call signalling is routed through.

It should be obvious that a call breaks when one of the participating endpoints fails. However, it is important to understand that a call will also break, if one of the signalling entities the call signalling is routed through breaks. In this case, the signalling connection associated to the call will be broken and - possible after some timeout - the endpoint will detect this and close the conversation.