Reference:Device Health Check: Difference between revisions

From innovaphone wiki
Jump to navigation Jump to search
(→‎Problem Details: FLASHMAN added)
No edit summary
Line 1: Line 1:
Device Health Check
==Applies To==
==Applies To==
This information applies to
This information applies to


* all innovaphone gateways
* all innovaphone devices


<!-- Keywords: enter keywords, foreign translations and/or synoyms not appearing in the article here for better search results -->
<!-- Keywords: enter keywords, foreign translations and/or synoyms not appearing in the article here for better search results -->
Line 8: Line 9:
==More Information==
==More Information==


If an innovaphone device traps(unwanted re-start) you have different possibilities for debugging. Also if the systems malfunctions without trapping (e.g lost or rejected calls, bad voice quality, etc.) the following working procedure might help finding the problem.
This article advises how to perform a quick ''health check'' on an innovaphone device.  This may server as guidance for routine checking of a PBX systems state as well as to start debugging when a devices malfunctions.


===Problem Details===


===Overview===
To perform a device health check, the following steps are recommended
* Inspect the device alarm table
* Inspect the device event list
* Inspect performance counters
* Check volatile memory (RAM) usage
* Check CPU usage
* Check persistent memory (Flash) usage
* Check CF card usage
* Examine system log
* Check for efficient PBX configuration
Depending on the device type, some of the steps may or may not apply.
=== The Alarm Table ===
The systems [[Reference8:Administration/Diagnostics/Alarms|alarm table]] available under ''Administration/Diagnostics/Alarms'' should always be empty.  If there are entries in it, examine them carefully and fix the problem so that the alarms disappears.  The reason for this is pretty simple: if the alarm table is filled with entries that you already have checked but considered ''acceptable'', then it will not take long until a severe problem will hide itself in between the harmless entries.  So make it a habit that your alarm table is empty always.
The alarms have a ''details page'' available in the ''Code>'' columns.  These sometimes show useful further information.  Also, many of the error codes have a dedicated wiki help page, available through the details page ''Help'' button. 
=== The Event List ===
The systems [[Reference8:Administration/Diagnostics/Events|events list]] available under ''Administration/Diagnostics/Events'' may contain a fair amount of entry types.  As opposed to the alarm table, where the entry is removed when the problem is fixed, entries in the event list are not removed (except the list exceeds its allowable size in which case the oldest entries will be removed).    As a result, your event list will likely rarely be empty.  You need to work through the list and analyse each entry to determine if of it not it is still relevant.  To avoid analysing entries again and again, you can clear the whole list when finished, or (from V8 on), you can declare it as ''already taken care of'' using the ''Mark'' button in the detail page (which will render the entries ''Code'' column in green instead of red).
As for alarms, it is important to take care of problems which frequently create entries in the event list, as otherwise you will most likely overlook severe problems in a crowded event list. 
=== The Performance Counters ===
The systems [[Reference8:Administration/Diagnostics/Counters | performance counters ]] available under ''Administration/Diagnostics/Counters'' are another valuable tool to determine its health status.  These are graphs which show the status of  certain resource over the last 24 hours (8 hours are shown, you can scroll through time using the arrow buttons).  Each individual value on the x-axis is a 2-minutes average. Lets have look at the resource counters currently available.
==== CPU ====
This counter shows the total CPU usage.  A system that runs near to 100% all the time is a candidate for unpleasant behaviour, obviously.  However, running on 100% simply says that things are getting slow.  It does not necessarily say that things don't work.  But as a rule of life, demand is always increasing, not decreasing, so there are two options in such a situation:
* be prepared for an upgrade
* eliminate the reason for excessive CPU usage
We will discuss in a later section of this article how to determine who the CPU hog is.
We generally recommend to make sure CPU load is not more than two thirds in average.  This of course is just a rule of thumb. You may well run with a system with higher load or you may experience problems with a system with lower load. Still, if your system continuously runs with more than 2/3 of its CPU power, you should ask yourself if this is expected behaviour. 
Also, watch out for unusual peaks.  When you have a system that has a healthy looking CPU load graph, see if there are perhaps thin peaks.  These would indicate short time frames with high load.  They do not look impressive in the graph, still they may create severely bad user experience.  Suppose for example there is a CPU usage pattern which creates 100% CPU load for a minute every once in a while.  This would look like a comfortable 50% value in the 2-minutes graph slice.  The end user though may experience service outages, signalling time-outs, frustrating response times etc. during this period of time.  Such patterns usually look like a sharp ''needle'' in the graph and you should pay attention if you observe such.
==== CPU-R ====
This counter shows the amount of ''reserved CPU'' time.  As opposed to the CPU counter, this is not CPU load that actually is used, but CPU performance which needs to be reserved for real-time application.  Sending/receiving RTP data is considered real-time, for example.  As opposed to, say HTTP access, where missing CPU performance merely makes things running slower, missing CPU performance for transmission of RTP data results in voice drop-outs and thus is considered a failure.  This is why calls are rejected right away when no CPU performance can be reserved. When your CPU-R counter runs near to 100% often, then your system is overloaded and needs to be upgraded.  Note that real-time apps do not suffer from 100% CPU values though, as the innovaphone operating system features a very efficient real-time prioritization.
-----
*In case that the device trapped, you should get a trace file from the device. The trace can be obtained by clicking on '''trace(buffer)''' in the [[Reference7:Administration/Diagnostics/Tracing | Administration/Diagnostics/Tracing]] menu.
*In case that the device trapped, you should get a trace file from the device. The trace can be obtained by clicking on '''trace(buffer)''' in the [[Reference7:Administration/Diagnostics/Tracing | Administration/Diagnostics/Tracing]] menu.



Revision as of 19:29, 26 July 2010

Device Health Check

Applies To

This information applies to

  • all innovaphone devices


More Information

This article advises how to perform a quick health check on an innovaphone device. This may server as guidance for routine checking of a PBX systems state as well as to start debugging when a devices malfunctions.


Overview

To perform a device health check, the following steps are recommended

  • Inspect the device alarm table
  • Inspect the device event list
  • Inspect performance counters
  • Check volatile memory (RAM) usage
  • Check CPU usage
  • Check persistent memory (Flash) usage
  • Check CF card usage
  • Examine system log
  • Check for efficient PBX configuration

Depending on the device type, some of the steps may or may not apply.

The Alarm Table

The systems alarm table available under Administration/Diagnostics/Alarms should always be empty. If there are entries in it, examine them carefully and fix the problem so that the alarms disappears. The reason for this is pretty simple: if the alarm table is filled with entries that you already have checked but considered acceptable, then it will not take long until a severe problem will hide itself in between the harmless entries. So make it a habit that your alarm table is empty always.

The alarms have a details page available in the Code> columns. These sometimes show useful further information. Also, many of the error codes have a dedicated wiki help page, available through the details page Help button.

The Event List

The systems events list available under Administration/Diagnostics/Events may contain a fair amount of entry types. As opposed to the alarm table, where the entry is removed when the problem is fixed, entries in the event list are not removed (except the list exceeds its allowable size in which case the oldest entries will be removed). As a result, your event list will likely rarely be empty. You need to work through the list and analyse each entry to determine if of it not it is still relevant. To avoid analysing entries again and again, you can clear the whole list when finished, or (from V8 on), you can declare it as already taken care of using the Mark button in the detail page (which will render the entries Code column in green instead of red).

As for alarms, it is important to take care of problems which frequently create entries in the event list, as otherwise you will most likely overlook severe problems in a crowded event list.

The Performance Counters

The systems performance counters available under Administration/Diagnostics/Counters are another valuable tool to determine its health status. These are graphs which show the status of certain resource over the last 24 hours (8 hours are shown, you can scroll through time using the arrow buttons). Each individual value on the x-axis is a 2-minutes average. Lets have look at the resource counters currently available.

CPU

This counter shows the total CPU usage. A system that runs near to 100% all the time is a candidate for unpleasant behaviour, obviously. However, running on 100% simply says that things are getting slow. It does not necessarily say that things don't work. But as a rule of life, demand is always increasing, not decreasing, so there are two options in such a situation:

  • be prepared for an upgrade
  • eliminate the reason for excessive CPU usage

We will discuss in a later section of this article how to determine who the CPU hog is.

We generally recommend to make sure CPU load is not more than two thirds in average. This of course is just a rule of thumb. You may well run with a system with higher load or you may experience problems with a system with lower load. Still, if your system continuously runs with more than 2/3 of its CPU power, you should ask yourself if this is expected behaviour.

Also, watch out for unusual peaks. When you have a system that has a healthy looking CPU load graph, see if there are perhaps thin peaks. These would indicate short time frames with high load. They do not look impressive in the graph, still they may create severely bad user experience. Suppose for example there is a CPU usage pattern which creates 100% CPU load for a minute every once in a while. This would look like a comfortable 50% value in the 2-minutes graph slice. The end user though may experience service outages, signalling time-outs, frustrating response times etc. during this period of time. Such patterns usually look like a sharp needle in the graph and you should pay attention if you observe such.

CPU-R

This counter shows the amount of reserved CPU time. As opposed to the CPU counter, this is not CPU load that actually is used, but CPU performance which needs to be reserved for real-time application. Sending/receiving RTP data is considered real-time, for example. As opposed to, say HTTP access, where missing CPU performance merely makes things running slower, missing CPU performance for transmission of RTP data results in voice drop-outs and thus is considered a failure. This is why calls are rejected right away when no CPU performance can be reserved. When your CPU-R counter runs near to 100% often, then your system is overloaded and needs to be upgraded. Note that real-time apps do not suffer from 100% CPU values though, as the innovaphone operating system features a very efficient real-time prioritization.


  • In case that the device trapped, you should get a trace file from the device. The trace can be obtained by clicking on trace(buffer) in the Administration/Diagnostics/Tracing menu.

Writing to the trace buffer is disabled when the device traps. Also, the trace buffer is not cleared on a re-start. Thus, after a re-start, you can obtain the trace from before the re-start, showing the trap situation! This trace will contain the reason for the restart.


  • Have a look at the Administration/Diagnostics/Counters menu. The counters show the current usage of different components. Also a view of their previous usage (up to 12 hours in the past) is possible.
    • CPU - shows the CPU load over time. Use the scroll function and look for unexpected high CPU load.
    • MEM - shows memory usage. In a running system the memory usage should stay constant. A linear increase of memory usage over time shows a memory leak.
    • TEL/PRI/BRI - shows the usage of B - channels on the interface.


Debugging check-list 1.png

Using the Total CPU time you can calculate the CPU usage of each module and look for possible problems. E.g. the HTTP module has the highest usage, you might want to look for problems related to SOAP (HTTP), Webmedia (HTTP) etc.


  • use the mem command, e.g. http://172.16.3.63/!mem. This will give an output that shows the memory usage of each module.


and memory used by LDAP(PBX User DB) and VARS(other settings). Make sure max value for current platform is not reached.

bottom 0xb0060000 base 0xb0060000 top 0xb0800000 segsize 0x20000 segments 61
LDAP - used 12k avail 114k owned 128k (max 3200k)
VARS - used 30k avail 67k owned 128k (max 128k)

0  0xb0060000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140001 usage 0x00000006
1  0xb0080000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140002 usage 0x00000006
2  0xb00a0000 free(0x00) owner FIRM(0x08) magic 0x666d order 0x00140003 usage 0x00000006
...
18  0xb02a0000 used(0x80) owner MINI(0x09) magic 0x666d order 0x00030001 usage 0x00000001
19  0xb02c0000 used(0x80) owner MINI(0x09) magic 0x666d order 0x00030002 usage 0x00000001