Reference:Device Health Check: Difference between revisions

From innovaphone wiki
Jump to navigation Jump to search
mNo edit summary
Line 60: Line 60:


==== PRIx / BRIx ====
==== PRIx / BRIx ====
These counters show the number of B-channels used on the respective ISDN interfaces.  Its interpretation should be failry straight forward.
These counters show the number of B-channels used on the respective ISDN interfaces.  Its interpretation should be fairly straight forward.


=== Finding the CPU Hog ===
In the event that you observe excessive CPU load, you need to find out which function actually consumes it.  Once you have done so, you can either try to re-configure the system so that the causing function is used less, or you can add an extra device and move this function to the new device.  For example, if you find that your PBX runs out of CPU resources because of heavy load on the system's CF-card, you may want to move the CF card to a different device which has more CPU cycles left.
Unfortunately, you cannot determine which ''function'' takes your CPU cycles away easily.  However, you can determine which ''module'' does.  In our previous example, you won't be able to determine that it's excessive config save operations on the CF card that produces the load, but you will be able to determine that its the CF card that does.
A ''module'' is a piece of code in the devices firmware that performs a certain task.  You can think of it as a ''thread'' in an operating system such as Windows or Linux.  ''Modules'' are named and you can retrieve a list of modules currently running on your device using the <code>!mod</code> command.  Assuming your device has the IP address 172.16.10, you would open [http://172.16.0.10/!mod http://172.16.0.10/!mod] with your browser.  You will receive a list of modules such as this one:
[[Image:Debugging_check-list_1.png]]
In this table, the first column (modules) and the second column (ticks) is of interest.  The second column shows the the number of ticks used by the module since the counters were reset the last time.  The system never resets the counters on its own.  They will be reset if you add the <code>clr</code> option to the <code>mod</code> command ([http://172.16.0.10/!mod+clr http://172.16.0.10/!mod clr]).  A ''tick'' by the way is 125 micro-seconds, so be prepared for large values. 
For easier analysis, you may want to import this table into your favourite spread sheet application.  In Microsoft's Excel (c) for example, you would highlight the whole table, copy it in to your copy&paste buffer, then use ''Inhalte einfügen / Paste Content'' with ''Quelle: Text / Source: Text'' so you get each column in a separate Excel column (you may need to use the ''Import Assistant'' then, later V8 and V9 output a more excel friendly table format).  After some cleanup (remove unused columns), your spread sheet should then look something like this:
[[Image:Device Health Check - mod-cmd-excel-1.png | Excel Import]]
You can then add a column (column E in our example) which relates the number of ticks used by a module to the total number of ticks used (<code>=B96/($D$98)</code>).  This is the relative load imposed by a certain module.  Finally, sort the table by this column and you get a list of CPU hog candidates:
[[Image:Device Health Check - mod-cmd-excel-2.png | Excel Analysis]]
Unfortunately, there is no list of ''which module has which name'', but a look at the module name will give a good indication of what it is.  The typical scenario is that, IP0 (the IP stack), ETHx (the ethernet drivers) and H323 (the H.323 signalling stack) will be on top.  If the device is a gateway too, the DSP drivers (AC-DSPx) and SRTP encryption drivers (MV78X00_CRYPT) wil be in the top 10 too.  In our example, you can see that the compact flash driver (CFLASH) is also prominent.  Compared to that, the PBX itself - accounting for 2% of the cycles consumed - is a side-note only in our example. 
Be aware that these figures are counted from the last counter reset.  To identify the source of a CPU resource problem, you would wait for the situation to happen (i.e. you would wait for a time the CPU usage is high) and then do the math.  However, you should start the analysis by resetting the counters (<code>!mod clr</code>), then wait a significant amount of time (half a minute may be a good starter), then get the stats (<code>!mod</code) and analyse.  If you fail to reset the counters, your picture will be distorted as it is influenced by previous (and probably unknown) activity.


-----
-----
Line 79: Line 106:
*use the mod command, e.g. http://172.16.3.63/!mod. This will give you such an output:
*use the mod command, e.g. http://172.16.3.63/!mod. This will give you such an output:


[[Image:Debugging_check-list_1.png]]
 


Using the Total CPU time you can calculate the CPU usage of each module and look for possible problems. E.g. the HTTP module has the highest usage, you might want to look for problems related to SOAP (HTTP), Webmedia (HTTP) etc.
Using the Total CPU time you can calculate the CPU usage of each module and look for possible problems. E.g. the HTTP module has the highest usage, you might want to look for problems related to SOAP (HTTP), Webmedia (HTTP) etc.

Revision as of 11:31, 27 July 2010

Applies To

This information applies to

  • all innovaphone devices


More Information

This article advises how to perform a quick health check on an innovaphone device. This may serve as guidance for routine checking of a PBX systems state as well as to start debugging when a devices malfunctions.


Overview

To perform a device health check, the following steps are recommended

  • Inspect the device alarm table
  • Inspect the device event list
  • Inspect performance counters
  • Check volatile memory (RAM) usage
  • Check CPU usage
  • Check persistent memory (Flash) usage
  • Check CF card usage
  • Examine system log
  • Check for efficient PBX configuration

Depending on the device type, some of the steps may or may not apply.

The Alarm Table

The systems alarm table available under Administration/Diagnostics/Alarms should always be empty. If there are entries in it, examine them carefully and fix the problem so that the alarms disappears. The reason for this is pretty simple: if the alarm table is filled with entries that you already have checked but considered acceptable, then it will not take long until a severe problem will hide itself in between the harmless entries. So make it a habit that your alarm table is empty always.

The alarms have a details page available in the Code> columns. These sometimes show useful further information. Also, many of the error codes have a dedicated wiki help page, available through the details page Help button.

The Event List

The systems events list available under Administration/Diagnostics/Events may contain a fair amount of entry types. As opposed to the alarm table, where the entry is removed when the problem is fixed, entries in the event list are not removed (except the list exceeds its allowable size in which case the oldest entries will be removed). As a result, your event list will likely rarely be empty. You need to work through the list and analyse each entry to determine if of it not it is still relevant. To avoid analysing entries again and again, you can clear the whole list when finished, or (from V8 on), you can declare it as already taken care of using the Mark button in the detail page (which will render the entries Code column in green instead of red).

As for alarms, it is important to take care of problems which frequently create entries in the event list, as otherwise you will most likely overlook severe problems in a crowded event list.

The Performance Counters

The systems performance counters available under Administration/Diagnostics/Counters are another valuable tool to determine its health status. These are graphs which show the status of certain resource over the last 24 hours (8 hours are shown, you can scroll through time using the arrow buttons). Each individual value on the x-axis is a 2-minutes average. Lets have look at the resource counters currently available.

CPU

This counter shows the total CPU usage. A system that runs near to 100% all the time is a candidate for unpleasant behaviour, obviously. However, running on 100% simply says that things are getting slow. It does not necessarily say that things don't work. But as a rule of life, demand is always increasing, not decreasing, so there are two options in such a situation:

  • be prepared for an upgrade
  • eliminate the reason for excessive CPU usage

We will discuss in a later section of this article how to determine who the CPU hog is.

We generally recommend to make sure CPU load is not more than two thirds in average. This of course is just a rule of thumb. You may well run with a system with higher load or you may experience problems with a system with lower load. Still, if your system continuously runs with more than 2/3 of its CPU power, you should ask yourself if this is expected behaviour.

Also, watch out for unusual peaks. When you have a system that has a healthy looking CPU load graph, see if there are perhaps thin peaks. These would indicate short time frames with high load. They do not look impressive in the graph, still they may create severely bad user experience. Suppose for example there is a CPU usage pattern which creates 100% CPU load for a minute every once in a while. This would look like a comfortable 50% value in the 2-minutes graph slice. The end user though may experience service outages, signalling time-outs, frustrating response times etc. during this period of time. Such patterns usually look like a sharp needle in the graph and you should pay attention if you observe such.

CPU-R

This counter shows the amount of reserved CPU time. As opposed to the CPU counter, this is not CPU load that actually is used, but CPU performance which needs to be reserved for real-time application. Sending/receiving RTP data is considered real-time, for example. As opposed to, say HTTP access, where missing CPU performance merely makes things running slower, missing CPU performance for transmission of RTP data results in voice drop-outs and thus is considered a failure. This is why calls are rejected right away when no CPU performance can be reserved. When your CPU-R counter runs near to 100% often, then your system is overloaded and needs to be upgraded. Note that real-time apps do not suffer from 100% CPU values though, as the innovaphone operating system features a very efficient real-time prioritization.


MEM

This counter shows the total volatile memory (RAM) usage. Memory usage is in fact more critical than the CPU usage, as as system with low or no CPU resources left will still run, albeit slowly. A system with no memory resources will stop working an re-boot instead! It is thus crucial to have an eye on memory usage. You may observe that the memory usage graph usually is flat, that is, the value never decreases and rarely increases, especially once the device ran while after a re-boot. This is because the memory allocation strategy eventually claims memory for a specific purpose if needed, but never de-allocates it further on. Instead, objects allocated but no longer used are marked as free and re-used when an object of the same type is needed later on. The memory allocation graph shows all allocated objects, including those which are used and those which are currently free (as they are not available for use by objects of other types).

If memory usage grows steadily, there is most likely a memory leak somehow. That is, a function in the device allocates objects and this claims memory but fails to mark those objects as free when done. if this happens, each function invocation will result in lost memory and the only way to recover from this situation is to re-boot the system. We will discuss how to track down memory hogs in a later section.

PRIx / BRIx

These counters show the number of B-channels used on the respective ISDN interfaces. Its interpretation should be fairly straight forward.

Finding the CPU Hog

In the event that you observe excessive CPU load, you need to find out which function actually consumes it. Once you have done so, you can either try to re-configure the system so that the causing function is used less, or you can add an extra device and move this function to the new device. For example, if you find that your PBX runs out of CPU resources because of heavy load on the system's CF-card, you may want to move the CF card to a different device which has more CPU cycles left.

Unfortunately, you cannot determine which function takes your CPU cycles away easily. However, you can determine which module does. In our previous example, you won't be able to determine that it's excessive config save operations on the CF card that produces the load, but you will be able to determine that its the CF card that does.

A module is a piece of code in the devices firmware that performs a certain task. You can think of it as a thread in an operating system such as Windows or Linux. Modules are named and you can retrieve a list of modules currently running on your device using the !mod command. Assuming your device has the IP address 172.16.10, you would open http://172.16.0.10/!mod with your browser. You will receive a list of modules such as this one:



In this table, the first column (modules) and the second column (ticks) is of interest. The second column shows the the number of ticks used by the module since the counters were reset the last time. The system never resets the counters on its own. They will be reset if you add the clr option to the mod command (http://172.16.0.10/!mod clr). A tick by the way is 125 micro-seconds, so be prepared for large values.

For easier analysis, you may want to import this table into your favourite spread sheet application. In Microsoft's Excel (c) for example, you would highlight the whole table, copy it in to your copy&paste buffer, then use Inhalte einfügen / Paste Content with Quelle: Text / Source: Text so you get each column in a separate Excel column (you may need to use the Import Assistant then, later V8 and V9 output a more excel friendly table format). After some cleanup (remove unused columns), your spread sheet should then look something like this:


Excel Import

You can then add a column (column E in our example) which relates the number of ticks used by a module to the total number of ticks used (=B96/($D$98)). This is the relative load imposed by a certain module. Finally, sort the table by this column and you get a list of CPU hog candidates:


Excel Analysis


Unfortunately, there is no list of which module has which name, but a look at the module name will give a good indication of what it is. The typical scenario is that, IP0 (the IP stack), ETHx (the ethernet drivers) and H323 (the H.323 signalling stack) will be on top. If the device is a gateway too, the DSP drivers (AC-DSPx) and SRTP encryption drivers (MV78X00_CRYPT) wil be in the top 10 too. In our example, you can see that the compact flash driver (CFLASH) is also prominent. Compared to that, the PBX itself - accounting for 2% of the cycles consumed - is a side-note only in our example.

Be aware that these figures are counted from the last counter reset. To identify the source of a CPU resource problem, you would wait for the situation to happen (i.e. you would wait for a time the CPU usage is high) and then do the math. However, you should start the analysis by resetting the counters (!mod clr), then wait a significant amount of time (half a minute may be a good starter), then get the stats (!mod</code) and analyse. If you fail to reset the counters, your picture will be distorted as it is influenced by previous (and probably unknown) activity.


  • In case that the device trapped, you should get a trace file from the device. The trace can be obtained by clicking on trace(buffer) in the Administration/Diagnostics/Tracing menu.

Writing to the trace buffer is disabled when the device traps. Also, the trace buffer is not cleared on a re-start. Thus, after a re-start, you can obtain the trace from before the re-start, showing the trap situation! This trace will contain the reason for the restart.


  • Have a look at the Administration/Diagnostics/Counters menu. The counters show the current usage of different components. Also a view of their previous usage (up to 12 hours in the past) is possible.
    • CPU - shows the CPU load over time. Use the scroll function and look for unexpected high CPU load.
    • MEM - shows memory usage. In a running system the memory usage should stay constant. A linear increase of memory usage over time shows a memory leak.
    • TEL/PRI/BRI - shows the usage of B - channels on the interface.



Using the Total CPU time you can calculate the CPU usage of each module and look for possible problems. E.g. the HTTP module has the highest usage, you might want to look for problems related to SOAP (HTTP), Webmedia (HTTP) etc.


  • use the mem command, e.g. http://172.16.3.63/!mem. This will give an output that shows the memory usage of each module.


and memory used by LDAP(PBX User DB) and VARS(other settings). Make sure max value for current platform is not reached.

bottom 0xb0060000 base 0xb0060000 top 0xb0800000 segsize 0x20000 segments 61
LDAP - used 12k avail 114k owned 128k (max 3200k)
VARS - used 30k avail 67k owned 128k (max 128k)

0  0xb0060000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140001 usage 0x00000006
1  0xb0080000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140002 usage 0x00000006
2  0xb00a0000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140003 usage 0x00000006
...
18  0xb02a0000 used(0x80) owner MINI(0x09) magic 0x666d order 0x00030001 usage 0x00000001
19  0xb02c0000 used(0x80) owner MINI(0x09) magic 0x666d order 0x00030002 usage 0x00000001