Ads

IP PBX Alerts- How to Investigate?

ROSMAP SNMP IP-PBX ALARMS

Troubleshooting Process


Today we will discuss CUCM alrms generated in monitoring tool and how to do the troubleshooting basics around these SNMP traps , Below i will provide the error code , its explanations and troubleshooting steps to follow-
 
Gateway Status:

Description:  The Call Managers have reported that a gateway has deregistered: 334266: May 26 09:48:20.570 UTC : %CCM_CALLMANAGER-CALLMANAGER-2-MGCPGatewayLostComm: MGCP communication to gateway lost. Device Name:mtbozwm03var01 App ID:Cisco CallManager Cluster ID:IPTRNW20 Node ID:IPTRNW20RCHVEC1. Check!
Investigation procedure:

1.      Above alarm is indicating MGCP connection issue.

2.      Verify port registration in CUCM. (If ports are registered in CUCM then skip to step 8)

3.      Verify Gateway is reachable through ping. (If reachable skip to step 8)

4.      If gateway is not reachable then verify with Voice Tower for any power issue

5.      IF no power issue then engage onsite for network issue on site

6.      If network is ruled out by onsite then ask to open a tracker ticket for FE visit to check physical connectivity issue.

7.      If FE verifies physical /cabling then engage Cisco TAC for router RMA based on symptoms.

8.      Login to gateway. Issue three show commands 1) “Show mgcp” to verify mgcp is active on router  2) “Show ccm-manager “ to verify mgcp registration and 3) “show mgcp endpoints” to verify mgcp registered endpoints.

9.      Login to CUCM at the same time to check if ports are configured properly

10.   If T1/E1 is bouncing then engage carrier else if its FXS/FXO port then send FE to check cabling.

11.   Follow up with Telco till resolution

2.   Alarm PhoneDereg

Description: This alarm is generated in case phones at a site get unregistered or loose connectivity to call manager

Alarm severity: 2

SNMP Alert: Priority: [1 - Critical] - Status: [New] - Assigned: [] - Category: [AutoCase, Cisco_ROS, Servers, AMER, AMER-GWM, AMER-US] Updated by: (Mgmt Application Platform Cisco) - Fri, 27 Jun 2014 15:29:06 EDTWorkflow: []

New Auto Case Opened
PhoneDereg events have been received in the previous 60 seconds!

Device IP [Reason Code]: DeviceName - Description

10.186.53.115 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx – End Users Name , XXX-XXX-XXXX***End User ID
10.186.53.63 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name XXX-XXX-XXXX *** End User ID.
10.186.53.199 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name  XXX-XXX-XXXX *** End User ID

Investigation procedure:

1.      Login to CUCM and find out site id on which phone is located.

2.      Search phone with MAC address

3.      Check description of the phone and search again with description as criteria.

4.      Step 3 will list out phone with same description belonging to same site.

5.      Ping or browse some of the phone IP to check whether it is responding.

6.      If phone is not responding then engage HP NOC to troubleshoot data network issue.



Once we see this alarm in ROSMAP, we see it with a name IPTRXXXXXXXXXXXX where the highlighted alphanumeric value represents the cluster name (There is a reference sheet which has the IP addresses from where we can find out the IP address for the CUCM server).

The Mac addresses of the deregistered phones are mentioned in the body of the alarm, we need to login to the above found cluster and search for the particular MAC that is found in the alarm. From the phone configuration page, you can identify the device pool of the phone as shown in the figure below.

Further, move into that specific device pool and search for all the phones with unregistered status.          (Press CTRL + F and type Unregistered).





If the number of unregistered phones is more than five, we would perform the Inbound Outbound test for the site.

Inbound Outbound Call Test Procedure:

·    Open the Called Synthetic Phone (Labeled as <Device Pool>-CROS-IPCOM-TEST-Called) for that particular site and specify the number (HP 918662873191) in Forward All field. The synthetic phone is just the virtual phone added with a dummy mac address and with the profile of a 7960 phone. The status of the synthetic phone is irrelevant it may be unregistered or unknown in both the stats the phone can be used for call forward all.

     Make a call to this Called synthetic number; in the case you that you end up at the HP IVR then the test stands TRUE else we need to check the gateway reachability. There can be several scenarios where the phones would not be able to make inbound and outbound calls.

 3 PRI alarms:

Description: Controllers are down, Layer 2 connectivity is not there between the gateway and the service provider. In this case use the below mentioned command on the gateway and check if the layer 2 connectivity is active.

Investigation procedure:

Login to the Router .

R1config)# Show ISDN status

It will show below mentioned result

Global ISDN Switchtype = primary-ni
ISDN Serial0/0/0:23 interface
        dsl 0, interface ISDN Switchtype = primary-ni
    Layer 1 Status:
        ACTIVE
    Layer 2 Status:
        TEI = 0, Ces = 1, SAPI = 0, State = MULTIPLE_FRAME_ESTABLISHED
    Layer 3 Status:
        0 Active Layer 3 Call(s)
    Active dsl 0 CCBs = 0
    The Free Channel Mask:  0x807FFFFF
   Number of L2 Discards = 0, L2 Session ID = 8

Below mentioned result shows that the layer 2 connectivity is there between the gateway and the service provider.

        “TEI = 0, Ces = 1, SAPI = 0, State = MULTIPLE_FRAME_ESTABLISHED”

If the result shows TEI assigned then there is no layer 2 connectivity and we need to escalate the case to the service provider and follow up with them

       “TEI = 0, Ces = 1, SAPI = 0, State = TEI Assigned “

b) T1 controllers are down:  We need to check the controllers on the gateway using the following command

 R1 Config)# show controllers t1

 T1 0/0/0 is up.

  Applique type is Channelized T1

  Cablelength is long 0db

  Description: DID Range(s): 972-233-2198, 972-233-2816, 972-233-2896, 972-383-6900 ~ 6999, 972-385-4220 ~ 4229, 972-386-1300 ~ 1399, 972-980-8600 ~ 8699

  No alarms detected.

  alarm-trigger is not set

  Soaking time: 3, Clearance time: 10

  AIS State:Clear  LOS State:Clear  LOF State:Clear

  Version info Firmware: 20071011, FPGA: 13, spm_count = 0

  Framing is ESF, Line Code is B8ZS, Clock Source is Line.

  CRC Threshold is 320. Reported from firmware  is 320.

  Data in current interval (262 seconds elapsed):

     0 Line Code Violations, 0 Path Code Violations

     0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins

     0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs

  Total Data (last 3 15 minute intervals):

     0 Line Code Violations, 0 Path Code Violations,

     0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins,

     0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs

 The output “T1 0/0/0 is up” shows that the controllers is up

 The output “T1 0/0/0 is down” shows that the controller is down

 1.      Ensure that controller our voice port is not admin down. IF its admin down then check logs to see if there is any record for shutting down circuit.

2.      If not record found in show log then open a SSR to remove circuit from monitoring.

3.      If T1/E1 is bouncing or down then engage carrier for fault isolation.


4.    ALARM_TranConnError


Description: This alarm is generated in case any device is not able to get registered to CUCM.
This alarm is also generated simultaneously to a phone deregistration alarm (so could be updated with the same logs as mentioned in phone deregistration alarm after troubleshooting).


 Investigation procedure

Sometimes the MAC of the phone does not show in the alarm details, for this you may follow the below mentioned steps:

·     Go to the RTMT Application Syslogs and find the Transient Connection Failure alarm, double click on the alarm and you will find the details of the device i.e the MAC of the phone.


·     Login to the T server and put the IP address of the phone in the browser, this will open the phoned page (Only if the phone is registered with the call manager). We can find the mac address from the phone page.


·     Now this MAC could be searched in respective cluster.


If the phone is not found in any of the clusters then this phone needs to be configured in CUCM

·     In this scenario we are unaware of the device pool in which to configure the phone. Follow the below mentioned steps for the same:


·     Go to the CUCM page> Device>Gateway and click on find .You will get a list of all the gateways along with the IP address and we can match the first three octets of the IP in alarm to the IP of gateway.



·     Knowing the gateway we get the device pool info for that phone and we may configure the phone along with the description “To Stop Transient Connection Failure Error”.


ALARM_ConfResourceBridgeDereg

Description: This alarm is generated when the conference bridge is unregistered with the call manager.

Investigation procedure:

1.      Log in to the Call manager >Media Resources>Conference Bridge as shown below (fig 1.1).

(fig 1.1)

2.      Status Registered: If the status of the conference bridge is registered, close the alarm stating that the conference bridge is registered with the call manager (fig 1.2).
    Status Unregistered: if the status is unregistered there can be two scenarios A) there is no network connectivity between gateway and the CUCM. B) There is network connectivity but the gateway is not able to register.
A)     If there is not network connectivity, we need to restore the connectivity by contacting the network team and have them engage the service provider.
   If there is network connectivity between gateway and CUCM, then we need to get proper bank approvals to reset Conference Bridge via Call Manager, also we may need to log in to the gateway and bounce SCCP. To bounce SCCP we must verify if the gateway is a Cisco managed gateway and get the proper approvals from the bank to perform the SSCP reset.

Router config) # no sccp ( to stop sccp  on the gateway)

Router config) # sccp ( to re-enable sccp on the gateway).


6.  Alarm GRP_NODE:

There are in general 2 categories in which the above mentioned alarm is generated.
1.     Synthetic call failure

2.     An MGCP endpoint fails to register with the Call Manager.

Alarm GRP_NODE (Synthetic call failure)


Description: This alarm is generated when a site fails the Automatic IPCOM test.

Investigation Procedure

 In this case we need to perform the manual test for that site and check if inbound and outbound calls are working fine for that site. If the manual test succeeds then we need to close the alarm stating that the site has passed manual IPCOM testing.

If the manual test fails then we need to identify the cause (i:e why is the inbound and outbound calls not working) and update the alarm with the reason. We also need to change the status of the alarm from “Pending Resolution” to “Vendor case open” in case a ticket has been logged with the third party (Telco etc.).

Alarm GRP_NODE (MGCP endpoint fails to register)


Description: This alarm is generated when an MGCP end point fails to register with the call manager. In some cases there is an endpoint configured with the call manager to suppress the transient connection errors.

Process:  Scenario 1 the port is configured to suppress the transient connection errors. In this case we need to check the port status on the gateway to confirm the status of the port, for these endpoint the port status is generally down as these endpoints are just to stop the transient connection errors and nothing else.

Scenario 2 the port is configured with a device connected to it, either a postage machine, fax machine or any other analog device. In this case we need to check the status of the registration of MGCP port with the call manager and also need to check the port status on the gateway.

Checking the port status on the Call manager: We need to understand the way MGCP port is labeled for Reading MGCP port in CUCM: AALN/S2/SU0/0@WILCXFS01VAR01

AALN: Analog access line

S2: Slot 2

SU0: Sub unit 0

0: Port 0

@WILCXFS01VAR01: @ MGCP gateway name;

·        Port status in CUCM Registered: If the port status is registered, close the alarm stating that the port status is registered with the call manager.

·        Port status in CUCM Rejected: If the port status is rejected in the CUCM verify if the port has a DID configured, if not open a change to request to have the configuration removed. If there is a DID configured. Check the status of the port on MGCP gateway using the following command


Log in to the gateway and use;

WILCXFS01VAR01#sh voice port 2/0/0

The above command will show the following result (Command output has been customized for brevity)

Foreign Exchange Station 2/0/0 Slot is 2, Sub-unit is 0, Port is 0

 Type of Voice Port is FXS
 Operation State is DOWN
 Administrative State is DOWN

Port status is Administrative down: if the port status is admin down then we need to close the alarm stating that the port status is administratively down.

Port status is Administrative UP: if the port status is administratively UP it means that the configuration of the MGCP port is fine on the gateway, we need to reset the port in CUCM, also check with the user if the device is still in use. If the issue is not resolved by resetting the port in CUCM and the device is in use we will need to get bank approval to bouncing the port on the gateway. If this doesn’t resolves the issue escalates the case to L2.

7.  ALARM_CodeYellow 
Description: Code Yellow is a state that Call Manager enters when it is not able to process its internal signals at a fast enough rate to provide good quality of service for voice and video calls.

Before we start with the Code Yellow alarm troubleshooting, we need to identify if the code yellow has been triggered due to some scheduled activity ( e;g CTI server reboot) or Code yellow has been triggered due to some other issue (e;g Sudden spike in CPU utilization).

Code yellow due to scheduled activity:   In case the code yellow alarm has been generated due to scheduled activity, an alarm is generated in ROSMAP and a notification is sent through RTMT in an email stating that CUCM has entered in the code yellow state. These scheduled activities like CTI server reboot are generally performed over the weekends. We need to wait for the CTI exit notification, once the exit notification is received we need to close the Code Yellow alarm stating that the site has exited the code yellow state.

Code yellow due to other reasons: In case the code yellow is  generated due to reason other than scheduled activity i:e Sudden spike in CPU utilization of the server. We need to log in to RMT to check the application log for the reason of code yellow generation. Once the reason for code yellow is identified through the syslog, escalate the case to L2 for further investigation. Please follow below document to troubleshoot code yellow and call throttling on CUCM.


SNMP alarms:

SNMP:cpqAccelBattery.1

ciscoEnvMonFanState.1

SNMP_cpqLogDrvCondition


SNMP alarms are generated for various reasons. We need to open case and look into log entries to fiind out for which reason snmp alarm has been generated.

Hard drive failure: SNMP_cpqLogDrvCondition
Description: This alarm is generated when the physical/Logical drive on the server is in a critical state. We need to first identify what is the exact status of the drive using the on demand poling function of ROSMAP.
Process: Click on the Hammer alongside the alarm as shown in the below figure.
Scroll through the items in the list in the subscriber page and find the cpqLogDrvCondition.0.....Condition of Logical Drive 2 click on the P symbol next to it as shown below for the real time status of the drive.

If the polling status shows that the “<IP Address of the server > cpqLogDrvCondition .X is dead “ this means that the drive is in the critical state ( the state can be found from the CLI of the CCM server). In case if the status of the on demand polling is “<IP Address of the server > cpqLogDrvCondition .X  is alive “ this means that the status of the drive is ok, we need to confirm the status from CLI of the CCM server  to double check.

a)  We need to run the following command on the CCM CLI.
Admin : Show hardware
This will show the below mentioned output

Smart Array 6i in Slot 0
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB
Status: Ok
Array Accelerator: Enabled
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1

If the status is OK close the alarm with pasting the result of the CLI in the alarm.

b)  If the Output of the above mentioned alarm is Interim Recovery.

Smart Array 6i in Slot 0
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB
Status: Interim Recovery

Array Accelerator: Enabled
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1

In this case we need to open a case for drive replacement. Open a ticket with the hardware vendor along with the ADU report.
ADU Report stands for Array Diagnostic Utility report. This report gives  extensive detail about the status of the RAID. The process of generating the ADU report is performed in 2 stages first, saving of file on the SFTP server second generating report from the CLI.

SFTP server : first we need to enable the sftp service on the SFTP server (21.1.202.202).

.Login to the server using the RDP.

Click on all programs and select FreeFTPd as shown below
If the SFTP server is offline then we need to start the SFTP server


Click on the SFTP and click on start button as shown below.

Once the SFTP service starts the status will show as below.


Once the SFTP is up and running we need to go to the CLI  using secure CRT tool to generate and save the report on the SFTP

As shown below use the command on the CLI

admin:utils create report hardware

This will show the below mentioned output.

*** WARNING ***

This process can take several minutes as the disk array, remote console,
system diagnostics and enviromental systems are probed for their currentvalues.
Continue (y/n)?y
HP Detected - Collecting Disk Array Data...SmartArray Equipped server detected...Done
Collecting Environmental Data...Done
Collecting Remote Console System Log Data...Done
Creating single compressed system report...Done
System report written to SystemReport-20130819043713.tgz
To retrieve diagnostics use CLI command:
file get activelog platform/log/SystemReport-20130819043713.tgz

Use the below mentioned command to generate the report.

admin:file get activelog platform/log/SystemReport-20130819043713.tgz
Please wait while the system is gathering files info ...done.
Sub-directories were not traversed.
Number of files affected: 1
Total size in Bytes: 25251
Total size in Kbytes: 24.65918
Would you like to proceed [y/n]? y
SFTP server IP: 21.1.202.202
SFTP server port [22]:
User ID: deekuma4
Password: *****
Download directory: /

Transfer completed.

This will save the output of the ADU report on the SFTP server as shown below

Attach the ADU report in the ticket that we are escalating to the vendor for the hard drive replacement.

Note: ADU report is helpful to Identify if there are any other issues with the System. When we open ticket with Vendor provide them this report for finding root cause and any additional issues.

Update the Alarm status with the ticket and change the status to “Vendor case open”.

Alarm IPXXXXXXXXXX UPTIME


This alarm is the notification of the server reboot the server can be anything CCM, UCx , or CTI etc.

We need to identity if the server reboot has happened or there was connection flap in the ROSMAP and the server because of which ROSMAP has assumed that the server has rebooted. 
The best way to identify server reboot is RTMT check for any log for server reboot. If there is a log for server reboot then we need to log in to the CLI and check the uptime of the server using the following command. 
admin: show status

This will show the server uptime.

Uptime: 09:33:34 up 92 days, 14:09, 1 user, load average: 0.55, 0.60, 0.63
Once the server is rebooted we need to perform the QA checks to ensure the server is running normally after the reboot. Along with this we need to raise a TAC case for the RCA for unexpected server reboot.
Note: if the server has been rebooted due to scheduled activity we need to perform the QA and close the Alarm stating that the QA has been performed and the re-boot was due to scheduled activity.

UPTIME:

Router / server UPTIME alarms are for rebooting devices. It is good thing that device has come up but what we need to investigate is why it rebooted in the first place.

Device may reboot either due to power failure or software or service crash.

1.  Power failure: Contact site or voice tower for any power issues. If this issue is part of larger power issue then we may not be able to do much. But if only our router rebooted due to power failure then we have to engage HP NOC for wiring the dispatch to check any possible cable or UPS failures.

2.   Software/ service crash: Need to check logs of router or server to identify reason for restart. Use RTMT in case of UC devices to isolate issue further. Possible cuase of reboot may be high CPU /memory utilization or any bug.
      carschlr
Description:The Cisco CAR Scheduler service allows you to schedule CAR-related tasks; for example, you can schedule report generation or CDR file loading into the CAR database. This service starts automatically.
Investigation procedure:

Go to a tool server and login to RTMT. Server details will be there on the alarm.

Go to services and verify status and check if alarm has returned to normal. Close case if alarm has returned to normal.

We can also check if the service was uptime from the Call Manager serviceability page  . Serviceability/Tools > Control Center - Network Services .