ROSMAP SNMP IP-PBX ALARMS
Troubleshooting Process
Today we will discuss CUCM alrms generated in monitoring tool and how to do the troubleshooting basics around these SNMP traps , Below i will provide the error code , its explanations and troubleshooting steps to follow-
Gateway Status:
Description: The Call Managers have
reported that a gateway has deregistered: 334266: May 26 09:48:20.570 UTC :
%CCM_CALLMANAGER-CALLMANAGER-2-MGCPGatewayLostComm: MGCP communication to
gateway lost. Device Name:mtbozwm03var01 App ID:Cisco CallManager Cluster
ID:IPTRNW20 Node ID:IPTRNW20RCHVEC1. Check!
Investigation procedure:
1.
Above alarm is indicating
MGCP connection issue.
2.
Verify port registration in
CUCM. (If ports are registered in CUCM then skip to step 8)
3.
Verify Gateway is reachable
through ping. (If reachable skip to step 8)
4.
If gateway is not reachable
then verify with Voice Tower for any power issue
5.
IF no power issue then
engage onsite for network issue on site
6.
If network is ruled out by
onsite then ask to open a tracker ticket for FE visit to check physical
connectivity issue.
7.
If FE verifies physical
/cabling then engage Cisco TAC for router RMA based on symptoms.
8.
Login to gateway. Issue three
show commands 1) “Show mgcp” to verify mgcp is active on router 2) “Show ccm-manager “ to verify mgcp
registration and 3) “show mgcp endpoints” to verify mgcp registered endpoints.
9.
Login to CUCM at the same
time to check if ports are configured properly
10.
If T1/E1 is bouncing then
engage carrier else if its FXS/FXO port then send FE to check cabling.
11.
Follow up with Telco till
resolution
2.
Alarm PhoneDereg
Description: This
alarm is generated in case phones at a site get unregistered or loose
connectivity to call manager
Alarm severity: 2
SNMP Alert: Priority: [1 - Critical] - Status:
[New] - Assigned: [] - Category: [AutoCase, Cisco_ROS, Servers, AMER, AMER-GWM,
AMER-US] Updated by: (Mgmt Application Platform Cisco) - Fri, 27 Jun 2014
15:29:06 EDTWorkflow: []
New Auto Case
Opened
PhoneDereg events have been received in the previous 60 seconds!
Device IP [Reason Code]: DeviceName - Description
10.186.53.115 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx – End Users Name , XXX-XXX-XXXX***End User ID
10.186.53.63 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name XXX-XXX-XXXX *** End User ID.
10.186.53.199 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name XXX-XXX-XXXX *** End User ID
PhoneDereg events have been received in the previous 60 seconds!
Device IP [Reason Code]: DeviceName - Description
10.186.53.115 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx – End Users Name , XXX-XXX-XXXX***End User ID
10.186.53.63 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name XXX-XXX-XXXX *** End User ID.
10.186.53.199 [KeepAliveTimeout (13)] : SEPxxxxxxxxxxxx - End Users Name XXX-XXX-XXXX *** End User ID
Investigation procedure:
1.
Login to CUCM and find out site id on which
phone is located.
2.
Search phone with MAC address
3.
Check description of the phone and search again
with description as criteria.
4.
Step 3 will list out phone with same description
belonging to same site.
5.
Ping or browse some of the phone IP to check
whether it is responding.
6.
If phone is not responding then engage HP NOC to
troubleshoot data network issue.
Once we see this alarm in ROSMAP, we see it with a name IPTRXXXXXXXXXXXX where the highlighted
alphanumeric value represents the cluster name (There is a reference sheet
which has the IP addresses from where we can find out the IP address for the
CUCM server).
The Mac addresses of the deregistered phones are mentioned
in the body of the alarm, we need to login to the above found cluster and
search for the particular MAC that is found in the alarm. From the phone
configuration page, you can identify the device pool of the phone as shown in
the figure below.
Further, move into that specific device pool and search for
all the phones with unregistered status. (Press CTRL + F and type Unregistered).
If the number of unregistered phones is more than five, we
would perform the Inbound Outbound test for the site.
Inbound Outbound Call Test Procedure:
· Open the Called Synthetic Phone (Labeled as
<Device Pool>-CROS-IPCOM-TEST-Called) for that particular site and
specify the number (HP 918662873191) in Forward All field. The synthetic phone
is just the virtual phone added with a dummy mac address and with the profile of
a 7960 phone. The status of the synthetic phone is irrelevant it may be
unregistered or unknown in both the stats the phone can be used for call
forward all.
Make a call to this Called synthetic number; in the
case you that you end up at the HP IVR then the test stands TRUE else we need
to check the gateway reachability. There can be several scenarios where the
phones would not be able to make inbound and outbound calls.
3
PRI alarms:
Description:
Controllers are down, Layer 2 connectivity is not there between the gateway and
the service provider. In this case use the below mentioned command on the
gateway and check if the layer 2 connectivity is active.
Investigation procedure:
Login to the Router .
R1config)# Show ISDN status
It will show below mentioned result
Global ISDN Switchtype = primary-ni
ISDN Serial0/0/0:23 interface
dsl 0, interface ISDN Switchtype = primary-ni
Layer 1 Status:
ACTIVE
Layer 2 Status:
TEI = 0, Ces = 1, SAPI = 0, State = MULTIPLE_FRAME_ESTABLISHED
Layer 3 Status:
0 Active Layer 3 Call(s)
Active dsl 0 CCBs = 0
The Free Channel Mask: 0x807FFFFF
Number of L2 Discards = 0, L2 Session ID = 8
ISDN Serial0/0/0:23 interface
dsl 0, interface ISDN Switchtype = primary-ni
Layer 1 Status:
ACTIVE
Layer 2 Status:
TEI = 0, Ces = 1, SAPI = 0, State = MULTIPLE_FRAME_ESTABLISHED
Layer 3 Status:
0 Active Layer 3 Call(s)
Active dsl 0 CCBs = 0
The Free Channel Mask: 0x807FFFFF
Number of L2 Discards = 0, L2 Session ID = 8
Below mentioned result shows that the layer 2 connectivity
is there between the gateway and the service provider.
“TEI = 0, Ces = 1, SAPI = 0, State = MULTIPLE_FRAME_ESTABLISHED”
If the result shows TEI assigned then there is no layer 2
connectivity and we need to escalate the case to the service provider and
follow up with them
“TEI = 0, Ces = 1, SAPI = 0, State = TEI Assigned “
b) T1
controllers are down: We need to
check the controllers on the gateway using the following command
R1 Config)# show controllers t1
T1
0/0/0 is up.
Applique type is Channelized T1
Cablelength is long 0db
Description: DID Range(s): 972-233-2198, 972-233-2816, 972-233-2896,
972-383-6900 ~ 6999, 972-385-4220 ~ 4229, 972-386-1300 ~ 1399, 972-980-8600 ~
8699
No
alarms detected.
alarm-trigger is not set
Soaking time: 3, Clearance time: 10
AIS
State:Clear LOS State:Clear LOF State:Clear
Version info Firmware: 20071011, FPGA: 13, spm_count = 0
Framing is ESF, Line Code is B8ZS, Clock Source is Line.
CRC
Threshold is 320. Reported from firmware
is 320.
Data in current interval (262 seconds elapsed):
0 Line Code Violations, 0 Path Code Violations
0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins
0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs
Total Data (last 3 15 minute intervals):
0 Line Code Violations, 0 Path Code Violations,
0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins,
0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs
The output “T1 0/0/0 is up” shows that the
controllers is up
The output “T1 0/0/0 is down” shows that the
controller is down
1.
Ensure that controller our
voice port is not admin down. IF its admin down then check logs to see if there
is any record for shutting down circuit.
2.
If not record found in show
log then open a SSR to remove circuit from monitoring.
3.
If T1/E1 is bouncing or
down then engage carrier for fault isolation.
4.
ALARM_TranConnError
Description: This
alarm is generated in case any device is not able to get registered to CUCM.
This alarm is also generated simultaneously to a phone deregistration alarm (so could be updated with the same logs as mentioned in phone deregistration alarm after troubleshooting).
This alarm is also generated simultaneously to a phone deregistration alarm (so could be updated with the same logs as mentioned in phone deregistration alarm after troubleshooting).
Investigation procedure
Sometimes the MAC of the phone does not show in the alarm
details, for this you may follow the below mentioned steps:
· Go to the RTMT Application Syslogs and find the
Transient Connection Failure alarm, double click on the alarm and you will find
the details of the device i.e the MAC of the phone.
· Login to the T server and put the IP address of
the phone in the browser, this will open the phoned page (Only if the phone is
registered with the call manager). We can find the mac address from the phone
page.
· Now this MAC could be searched in respective
cluster.
If the phone
is not found in any of the clusters then this phone needs to be configured in CUCM
· In this scenario we are unaware of the device
pool in which to configure the phone. Follow the below mentioned steps for the
same:
· Go to the CUCM page> Device>Gateway and
click on find .You will get a list of all the gateways along with the IP
address and we can match the first three octets of the IP in alarm to the IP of
gateway.
· Knowing the gateway we get the device pool info
for that phone and we may configure the phone along with the description “To Stop Transient Connection Failure
Error”.
Description: This
alarm is generated when the conference bridge is unregistered with the call
manager.
Investigation procedure:
1.
Log in to the Call manager >Media Resources>Conference Bridge as shown
below (fig 1.1).
(fig 1.1)
2.
Status Registered: If the status of the
conference bridge is registered, close the alarm stating that the conference
bridge is registered with the call manager (fig 1.2).
Status Unregistered:
if the status is unregistered there can be two scenarios A) there is no
network connectivity between gateway and the CUCM. B) There is network
connectivity but the gateway is not able to register.
A)
If there is not network connectivity, we need to
restore the connectivity by contacting the network team and have them engage the
service provider.
If there is network connectivity between gateway
and CUCM, then we need to get proper bank approvals to reset Conference Bridge
via Call Manager, also we may need to log in to the gateway and bounce SCCP. To
bounce SCCP we must verify if the gateway is a Cisco managed gateway and get
the proper approvals from the bank to perform the SSCP reset.
Router config) # no sccp ( to stop sccp
on the gateway)
Router config) # sccp ( to re-enable sccp on the gateway).
6. Alarm GRP_NODE:
There are in general 2 categories in which the above
mentioned alarm is generated.
1.
Synthetic call failure
2.
An MGCP endpoint fails to register with the Call
Manager.
Alarm GRP_NODE (Synthetic call failure)
Description: This
alarm is generated when a site fails the Automatic IPCOM test.
Investigation
Procedure
In this case we need
to perform the manual test for that site and check if inbound and outbound
calls are working fine for that site. If the manual test succeeds then we need to close
the alarm stating that the site has passed manual IPCOM testing.
If the manual test fails then
we need to identify the cause (i:e why is the inbound and outbound calls not
working) and update the alarm with the reason. We also need to change the
status of the alarm from “Pending Resolution” to “Vendor case open” in case a
ticket has been logged with the third party (Telco etc.).
Alarm GRP_NODE (MGCP endpoint fails to register)
Description:
This alarm is generated when an MGCP end point fails to register with the call
manager. In some cases there is an endpoint configured with the call manager to
suppress the transient connection errors.
Process: Scenario 1 the port is configured to suppress
the transient connection errors. In this case we need to check the port status
on the gateway to confirm the status of the port, for these endpoint the port
status is generally down as these endpoints are just to stop the transient
connection errors and nothing else.
Scenario 2 the port is configured with a device connected to
it, either a postage machine, fax machine or any other analog device. In this
case we need to check the status of the registration of MGCP port with the call
manager and also need to check the port status on the gateway.
Checking the port status on the Call manager: We need to
understand the way MGCP port is labeled for Reading MGCP port in CUCM: AALN/S2/SU0/0@WILCXFS01VAR01
AALN: Analog
access line
S2: Slot 2
SU0: Sub unit 0
0: Port 0
@WILCXFS01VAR01: @ MGCP gateway name;
·
Port
status in CUCM Registered: If the port status is registered, close the
alarm stating that the port status is registered with the call manager.
·
Port
status in CUCM Rejected: If the port status is rejected in the CUCM verify
if the port has a DID configured, if not open a change to request to have the
configuration removed. If there is a DID configured. Check the status of the
port on MGCP gateway using the following command
Log in
to the gateway and use;
WILCXFS01VAR01#sh voice port 2/0/0
The above command will show the following result (Command
output has been customized for brevity)
Foreign Exchange Station 2/0/0 Slot is 2, Sub-unit is 0,
Port is 0
Type of Voice Port is
FXS
Operation State is DOWN
Administrative State is DOWN
Operation State is DOWN
Administrative State is DOWN
Port status is Administrative
down: if the port status is admin down then we need to close the alarm
stating that the port status is administratively down.
Port status is Administrative UP: if the port status is administratively UP it means that the
configuration of the MGCP port is fine on the gateway, we need to reset the
port in CUCM, also check with the user if the device is still in use. If the
issue is not resolved by resetting the port in CUCM and the device is in use we
will need to get bank approval to bouncing the port on the gateway. If this
doesn’t resolves the issue escalates the case to L2.
7. ALARM_CodeYellow
Description: Code Yellow is a state that Call
Manager enters when it is not able to process its internal signals at a fast
enough rate to provide good quality of service for voice and video calls.
Before we
start with the Code Yellow alarm troubleshooting, we need to identify if the
code yellow has been triggered due to some scheduled activity ( e;g CTI server
reboot) or Code yellow has been triggered due to some other issue (e;g Sudden
spike in CPU utilization).
Code yellow due to scheduled activity:
In case the code yellow alarm has
been generated due to scheduled activity, an alarm is generated in ROSMAP and a
notification is sent through RTMT in an email stating that CUCM has entered in
the code yellow state. These scheduled activities like CTI server reboot are
generally performed over the weekends. We need to wait for the CTI exit notification,
once the exit notification is received we need to close the Code Yellow alarm
stating that the site has exited the code yellow state.
Code yellow due to other reasons: In case the code yellow is generated due to reason other than scheduled activity
i:e Sudden spike in CPU utilization of the server. We need to log in to RMT to
check the application log for the reason of code yellow generation. Once the
reason for code yellow is identified through the syslog, escalate the case to
L2 for further investigation. Please follow below document to troubleshoot code
yellow and call throttling on CUCM.
SNMP alarms:
SNMP:cpqAccelBattery.1
ciscoEnvMonFanState.1
SNMP_cpqLogDrvCondition
SNMP alarms are generated for various reasons. We need to
open case and look into log entries to fiind out for which reason snmp alarm
has been generated.
Hard drive failure: SNMP_cpqLogDrvCondition
Process: Click on
the Hammer alongside the alarm as shown in the below figure.
Scroll through the items in the list in the subscriber page
and find the cpqLogDrvCondition.0.....Condition of Logical Drive 2 click on the
P symbol next to it as shown below for the real time status of the drive.
If the polling status shows that the “<IP Address of the server > cpqLogDrvCondition .X is dead “
this means that the drive is in the critical state ( the state can be found
from the CLI of the CCM server). In case if the status of the on demand polling
is “<IP Address of the server >
cpqLogDrvCondition .X is alive “ this
means that the status of the drive is ok, we need to confirm the status from
CLI of the CCM server to double check.
a) We need to run the following command on the CCM
CLI.
Admin : Show hardware
This will show the below mentioned output
Smart Array 6i in Slot 0
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB Status: Ok
Array Accelerator: Enabled
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB Status: Ok
Array Accelerator: Enabled
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1
If the status is OK close the alarm with pasting the result
of the CLI in the alarm.
b) If the Output of the above mentioned alarm is
Interim Recovery.
Smart Array 6i in Slot 0
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB Status: Interim Recovery
Logical Drive: 2
Size: 67.8 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 17433
Stripe Size: 128 KB Status: Interim Recovery
Array Accelerator: Enabled
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1
Has Data On Drive: True
Unique Identifier: 600508B10018433953525030344F0008
Preferred Controller Chassis Slot: 1
In this case we need to open a case for drive replacement.
Open a ticket with the hardware vendor along with the ADU report.
ADU Report stands
for Array Diagnostic Utility report. This report gives extensive detail about the status of the RAID.
The process of generating the ADU report is performed in 2 stages first, saving
of file on the SFTP server second generating report from the CLI.
SFTP server : first we need to enable the sftp service on
the SFTP server (21.1.202.202).
.Login to the server using the RDP.
Click on all programs and select FreeFTPd as shown below
If the SFTP server is offline then we need to start the SFTP
server
Click on the SFTP and click on start button as shown below.
Once the SFTP service starts the status will show as below.
Once the SFTP is up and running we need to go to the CLI using secure CRT tool to generate and save the
report on the SFTP
As shown below use the command on the CLI
admin:utils create report hardware
This will show the below mentioned output.
*** WARNING ***
This process can take several minutes as the disk array,
remote console,
system diagnostics and enviromental systems are probed for their currentvalues.
Continue (y/n)?y
HP Detected - Collecting Disk Array Data...SmartArray Equipped server detected...Done
Collecting Environmental Data...Done
Collecting Remote Console System Log Data...Done
Creating single compressed system report...Done
System report written to SystemReport-20130819043713.tgz
To retrieve diagnostics use CLI command: file get activelog platform/log/SystemReport-20130819043713.tgz
system diagnostics and enviromental systems are probed for their currentvalues.
Continue (y/n)?y
HP Detected - Collecting Disk Array Data...SmartArray Equipped server detected...Done
Collecting Environmental Data...Done
Collecting Remote Console System Log Data...Done
Creating single compressed system report...Done
System report written to SystemReport-20130819043713.tgz
To retrieve diagnostics use CLI command: file get activelog platform/log/SystemReport-20130819043713.tgz
Use the below mentioned command to generate the report.
admin:file get activelog
platform/log/SystemReport-20130819043713.tgz
Please wait while the system is gathering files info ...done.
Sub-directories were not traversed.
Number of files affected: 1
Total size in Bytes: 25251
Total size in Kbytes: 24.65918
Would you like to proceed [y/n]? y
SFTP server IP: 21.1.202.202
SFTP server port [22]:
User ID: deekuma4
Password: *****
Download directory: /
Please wait while the system is gathering files info ...done.
Sub-directories were not traversed.
Number of files affected: 1
Total size in Bytes: 25251
Total size in Kbytes: 24.65918
Would you like to proceed [y/n]? y
SFTP server IP: 21.1.202.202
SFTP server port [22]:
User ID: deekuma4
Password: *****
Download directory: /
Transfer completed.
This will save the output of the ADU report on the SFTP
server as shown below
Attach the ADU report in the ticket that we are escalating
to the vendor for the hard drive replacement.
Note: ADU report is helpful to Identify if there are any
other issues with the System. When we open ticket with Vendor provide them this
report for finding root cause and any additional issues.
Update the Alarm status with the ticket and change the
status to “Vendor case open”.
Alarm IPXXXXXXXXXX UPTIME
This alarm is the notification of the
server reboot the server can be anything CCM, UCx , or CTI etc.
We need to identity if the server reboot
has happened or there was connection flap in the ROSMAP and the server because
of which ROSMAP has assumed that the server has rebooted.
The best way to identify server reboot is
RTMT check for any log for server reboot. If there is a log for server reboot
then we need to log in to the CLI and check the uptime of the server using the
following command.
admin: show status
This will show the server uptime.
Uptime: 09:33:34 up 92 days,
14:09, 1 user, load average: 0.55, 0.60, 0.63
Once the server is rebooted we need to perform the QA checks
to ensure the server is running normally after the reboot. Along with this we
need to raise a TAC case for the RCA for unexpected server reboot.
Note: if the server has been rebooted due to scheduled
activity we need to perform the QA and close the Alarm stating that the QA has been
performed and the re-boot was due to scheduled activity.
UPTIME:
Router / server UPTIME alarms are for rebooting devices. It
is good thing that device has come up but what we need to investigate is why it
rebooted in the first place.
Device may reboot either due to power failure or software or
service crash.
1. Power failure: Contact site or voice tower for
any power issues. If this issue is part of larger power issue then we may not
be able to do much. But if only our router rebooted due to power failure then
we have to engage HP NOC for wiring the dispatch to check any possible cable or
UPS failures.
2.
Software/ service crash: Need to check logs of
router or server to identify reason for restart. Use RTMT in case of UC devices
to isolate issue further. Possible cuase of reboot may be high CPU /memory
utilization or any bug.
carschlr
Description:The Cisco CAR Scheduler service allows you to
schedule CAR-related tasks; for example, you can schedule report generation or
CDR file loading into the CAR database. This service starts automatically.
Investigation procedure:
Go to a tool server and login to RTMT. Server details will be
there on the alarm.
Go to services and verify status and check if alarm has
returned to normal. Close case if alarm has returned to normal.
We can also check if the service was uptime from the Call Manager
serviceability page . Serviceability/Tools > Control Center - Network Services
.