Wednesday, January 30, 2008

Monitoring CPU usage with OpenSTA via SNMP

the system’s web server, application server and database were deployed on Solaris machines. The information required was the CPU usage of the system during the test. The objective was to be able to quantify the CPU usage of the system for high-level performance modelling purposes. OpenSTA provides for the definition of SNMP collectors, and Solaris provides an SNMP service. An initial examination of SNMP Collectors Once the Solaris SNMP Service has been started, it is possible to define an SNMP Collector that can read a value from it.

This can be performed by:

1.In the Commander, right-click on the ‘Collectors’ folder and select ‘New Collector -> SNMP’.

2.Name the Collector appropriately, and then open it.

3.In the Collector’s ‘Edit Query…’ dialog put the machine’s name into the ‘Address’ section, and then ‘Browse Queries’.

This provides a drop down of list of collectable data that is available, and the current value of it. It is worth noting that OpenSTA can only monitor values that are returned via SNMP as an integer. This is not too limiting, but it is worth being aware of when designing the tests to be performed.

As an example, using the ‘interfaces’ section of the system it is possible to obtain details of the amount of network data coming into the system with a request of the following form:

public interfaces.ifTable.ifEntry.ifInOctets.2

Unfortunately, the standard interfaces provided do not include CPU usage figures. It is, therefore, necessary to examine the issue in more detail. Examining the Solaris SNMP Service What was required? OpenSTA can send ad-hoc SNMP requests to a system. The correct request to send to a system is defined using a ‘mib’ file, which is published by the equipment’s manufacturer. The MIB definitions are available in various public libraries as well as from the manufacturer, and using the SNMP resource links at the OpenSTA portal it was relatively easy to obtain the Solaris mib files.

To help in the examination of the system an SNMP browsing tool was also obtained. In this case I used the Getif tool, but there are a number of tools available. The tool allows the SNMP tree from the target Solaris system to be examined. The contents of the Solaris MIB Examining the Solaris mib file, the following entry was located:

rsUserProcessTime OBJECT-TYPE
SYNTAX Counter
ACCESSread-only
DESCRIPTION
"total number of timeticks used by user processes
since the system was last booted."
::= { sunHostPerf 1 }

Similar entries were available for the ‘Idle’, ‘Nice’, and ‘System’ time. Searching upwards in the MIB file, the following section was located:

sun OBJECT IDENTIFIER ::= { enterprises 42 }
productsOBJECT IDENTIFIER ::= { sun 2 }
sunMibOBJECT IDENTIFIER ::= { sun 3 }

sunHostPerf OBJECT IDENTIFIER ::= { sunMib 13 }

SNMP requests are defined as a series of ‘.’ Separated numbers, with non-table entries terminated by ‘.0’. Thus, the correct SNMP request will end; *.42.3.13.1.0. The use of the ‘Getif’ tool allows the SNMP tree to be examined, and that can provide the full SNMP request as: “.1.3.6.1.4.1.42.3.13.1.0”, or symbolically “.iso.org.dod.internet.private.enterprises.sun.sunmib.sunhostperf.userticks”. Building the OpenSTA Request Comparing the interfaces request shown earlier with the requests above it is clear they are in different formats. Using the full SNMP request in the OpenSTA doesn’t provide correct results, and so the question arises as to how to use the located SNMP request with OpenSTA? The key to this comes from examining the Microsoft SNMP requests built in to OpenSTA. Here the requests start “public enterprises.”, with the rest being Microsoft specific. In the sun mib, ‘42’ (for sun) follow the ‘enterprises’ section. Thus the request within OpenSTA is:

public enterprises.42.3.13.1.0

Using this request within the OpenSTA Query field, and then monitoring the results provides correct results. Note that the numbers are counters, with values collected since the machine was last rebooted, and so if the graphs are to be displayed visually it is worth selecting the ‘Delta Value’ option in the Query dialog. Similar Queries can then be constructed to obtain the other CPU allocation categories. Using the results To examine the usage of the data collected, the requests for the different CPU types were placed in a single collector definition. This collector can then be placed within a Monitoring script, which may then be manually started and stopped. When using the collector in a test, however, it will start and stop with the test by default.

The first stage of investigating the data being collected was to compare result measured over time across all the CPU fields to the output of ‘top’ running on the target machine. The ‘top’ program displays, among other data, the percentage CPU using over a variable time period. Whilst the monitoring was running, the value of the collectors was graphed over time. When using a delta value the first figure tends to be a large peak and then the values settle down. It is, therefore, possible to provide a reasonable view of the data during the test by using a ‘Rolling Graph’ in the monitor window, and waiting for the initial peak to scroll out of the window. Empirically this provides the information that the counters are incremented by a total of 100 ticks per second on a single CPU system, with those ticks spread across the CPU categories appropriately.

Once the data has been collected it may be exported to Excel for detailed analysis. In attempting this analysis, one issue that needs to be dealt with is the fact that the data arrives back at the collectors at different times in the export.

The following data was collected using a 5s sampling rate on a system that is mostly idle:
User Ticks Nice Ticks System Ticks Idle Ticks
00:06 0 152940 1141326 3.72E+08
00:11 0 0 1 2 498


When the system is busy the data can also get further out of step. In interpreting the information it is important to remember that the underlying data is a total for each category, and not a snapshot value. Where CPU usage measurement metrics are required for an operation, rather than a visual indication or general validation, it may be worthwhile not using delta values. Alternatively the delta values may be added back together between two points in time

No comments: