This document describe the monitoring system designed for the IP providers and used in RELCOM and a few other russion networks. This system is oriented to the 24x7 operator's staff and operator's reglament is described in the distinct document.
This system consist of this parts:
The monitoring process looks so:
The pre-defined information pages (allocated in 'OUT' subdirectory) are created for the speed and could be requested by the CGI script too.
The system control:
Every object does have the type defined by the one letter (R,I,B,M, see above), and it's unique name. There are collected for the every object:
Every object is polled every 10 - 30 seconds, as it is defined in Poll.conf configuration file. The results are written into the 'IFSUM' file and the system draw the screen view every 30 seconds (or for every CGI request). The data polled are summarised for some _average_ period (usially it is 2 - 3 minutes) and then recorded into the accounting ('stat') file, where boths _average_ and _maximum_ values are written. This files are used for the _graphics_, _reports_ or can be seen as _raw_ data by the operator's (WWW) screen.
snmpstatd daemon (which poll routers in background) define, if the object state is normal or not, and install the status _OK_, _WARNING_ or _ERROR_. In addition, the status _UNDEFINED_ can be installed if the daemon can not collect the data about an object. The WARNING state is equivalent to the _OVERLOADED_. The WWW system convert this states (O, W, E, U) by the adding _priority_ digit in accordance with the time (the E status converts first to E0 state, then to E1 or E4 state in dependence with the object priority, and so on), this helps to prevent operator from the watching the frauded (short failures etc) events. The state define the color used to draw an object on the screen, and (in some cases) the sound clip the system play in case of the important events.
In the WWW views, the status is shown by the color, and some other paramenets by the numbers in the table and (for the channel) by the colored bars. An operator can choose the screen view - total view, alarms only, or the full view for the single router. In addition, there is _status_ view where the total number of the different objects in the different states are shown, and (important just this view is responsible for the music alarms.
An object state is generated by the monitor, and can be changed by the _operator tickets_. An _operator ticket_ is the record in the journal which define NEW state as derived from the OLD one, with some comments, time of expiration and (may be) the condition when this ticket will be removed (it's the flag _remove this ticket in case of the restored normal state_). There is only a few states generated by the monitoring, and more states could be generated by the operator. Moreover, there is 2 types of the tickets - _permanent comment_ and _the comment to the current state_, furst is used by the seniour operator or by the sysadmin to change the object status (and priority) permanently, and should be replaced by the priority in the future revisions.
Table below describe all states, their origins and corresponding colors/sounds for the default configuration (this is configurable and can be changed in the installation process):
Table 1. Standard states.
Name | Origin | CÏÓÔÏÑÎÉÅ | Color | Weight | Sound | ||
---|---|---|---|---|---|---|---|
BGP | Channel: | Router: | |||||
E0 | MÏÎÉÔÏÒ | Just failed | MAROON | 220 | |||
E1 | MÏÎÉÔÏÒ | Failure | RED | 270 | sound,muz. | ||
E2 | Operator | Failure - fixing in progress | AQUA | 250 | |||
E3 | Operator | Failure - cant be fixed | PURPLE | 210 | |||
E4 | Monitor or operator | Important failure | FUCHSIA | 280 | sound,muz. | sound,muz. | |
O0 | Monitor | Just restored | LIME | 10 | |||
O1 | Monitor | Normal | GREEN | 5 | |||
O2 | Monitor | Normal | GREEN | 5 | |||
U0 | Monitor or operator | No data | BLUE | 200 | |||
U1 | Operator | No consider | GRAY | 200 | |||
U2 | Operator | In debug | NAVY | 200 | |||
U3 | Operator | Out of our competnce | BLACK | 200 | |||
W0 | Monitor | Overload appear | OLIVE | 120 | |||
W1 | Monitor | Overload | YELLOW | 180 | |||
W2 | Operator | Overload cant be fixed | TEAL | 150 |
In this table above origin describe where this state can appear from. The monitoring system itself can create O0, O1 (everything is just OK and everything is OK) states, E0, E1 (error, E1 means _error appeared recently_ and E1 means _error does have place more than 2 minute), E4 (as E1 but for the IMPORTANT objects where this revision determin if the object is IMPORTANT by the object's name - all objects named by the CAPITAL letters are important (it'll be changed in future releases); W0, W1 (the warnings - just appeared or existing more than 2 minutes), U0 - can not found object or monitor data for it
The E4 state allow to select important errors influenced the total network instead of the one object only. In this release it can be defined via the _permanent comment_ by the sysadmin, or the system treat any E1 for the CAPITAL_letters named object as E4.
All other states can be defined by the operators and their goal is to describe real (detailed by the operator) object state better
The rules used by the operators for the state installation should be defined in the _OPERATION GUIDE_ and depends from the company. The common rule is to set up any state different of E1 / E4 states for all failures which do not influense the total network as a whole, to allow operators to see new events when they are appeared. If the operators follow this policy, they show always all new and uninvestigated errors (failures) in the STATE window, and you can always see such events colored by the RED color on the ALARM window. In the future releases we decrease the number of operator-defined states to the little 2 states (Failure is fixed, and Failure can't be fixed for now), but with additional _PRIORITY_ allowed to mark any object as _for example) /NOTHING object - priority 0, or VERY IMPORTANT object - priority 5.
There is very important feature of the monitor to play music clips in case of some errors - for this revision, it is any ERROR with the router and an errors with the important INTERFACES. The clips could be listen from the table above, and their names and the statuses caused this clips to sound could be changed by the configuration. There is 2 ways to play clips - MIDI (recommended) plugin and _WAV_ (not recommended) plugin, first choise named _MUSIC_ and second as _SOUND_ everywhere in the tables and select menus. You should install MIDI plugins to use this feature; monitor try to determine if your brouser support MIDI or WAV files but it depends of the JavaScript features and can't be garanteed.
Any state caused by the MONITORING is followed by the REASON if it is not NORMAL state; the REASON and TIME OF EXISTANSE are showen on the different views.
The system collects this data about the monitoring objects:
This monitor revision does not use an information about BGP connections except _FULL_ screen describing the full information derived from the router.
Now let's show on the example - the screen describing the full router status (including the channels (interfaces) and BGP connections).
Table 2. FULL router screen.
|
First line of this table show the state of the router itself:
The following lines show us the channels (interfaces) and the BGP sessions. For example, analyse the line describing rich(1):
First, you should open the start windown of the monitor. Usially, it is 'http://your_server:8100/M' url for the operator's interface, and 'http://your_server:8100/U' for the link owner.
Then system allow you to choose and open one of a few different views of this system. To make this selection, you should understand what windows exist and what does they mean.
The system use 5 different windows:
|
|
This window show the summary picture (the number of the different states for any type of the objects), and (IMPORTANT) turn on alarming sounds in case of important error. In addition, most important failure is shown just in this window too (absent in the example above).
[no sound] current sound mode.
The system propose you a few pre-defined window locations, and first you see the starting screen which ask you to choose one of the window locations. This screen looks so:
edit M/bin/p_index.pll | MONITOR: [KOI-8] [WIN] | [HOME] | [INFO] | [ADMIN] | [PUBLIC] | [LINKS] | [LINKS INT] |
MONITOR: [KOI-8] [WIN] | [HOME] | [INFO] | [ADMIN] | [PUBLIC] | [LINKS] | [LINKS INT] |
To open monitor screen (remember - we are talking about the screens only, the monitor daemon 'snmpstatd' must run in the backgroung always), you should:
operation | screen |
---|---|
PAGE.frame_all | All in one screen |
PAGE.frame_small | Small menu |
TOTAL | Total network view |
ALARM | Alarms only |
This monitor use standard HTTP technology. Almost all object names, just as an interface names, and menu bars on the screens are the html references and opens new screens (in the same or another window) when you click into them. As usial, you can always choose new window for any reference by the middle mouse button (in case of 3-button mouse) or by pressing the right button and selection from the menu.
Remember that, if the reference you choose usially opens in the new external (and named) window (such as LINK window), and this window (1) exists and (2) minimized, you (in dependence of your OS) have a chance don't see the new document at once, you should found and open the minimized window first.
When you work with the monitor, almost all usefull information can be shown in the MONITOR screem. It should be the TOTAL or ALARM network views, just as the network snapshort or the system journal.
There is 3 types of the screens (frames) in the monitor. First type are those screens which are refreshed periodically - this are SHORT, TOTAL, ALARMS screens called by the main menu buttons. This screens are refreshed every 30 or 60 seconds, and are previously prepared by the mon_daemon (which update the html files every 30 seconds). This views are the main network views, but they show the view with some (30 - 40) seconds delay, because they are not calculated _on the fly_ due to the performanse reasons. Through, 'snapshort' view is built just when it is called, on the fly.
The second type of the picturs is the router pictures showed in the ROUTER window (or the frame). This views are calculated on the fly and refresh every 30 seconds (if you don't use T=time parametr). Do not run too many such views in a time - you have a chance to overload http server.
The third type - static views and menus, they are calculated on the fly but do not refresh automatically.
Every refreshed screen have an information about the time and status number when it was calculated, they are shown on the top of the screens TOTAL, ROUTER, ALARM and simular. In case of the troubles (for example, snmpstatd is dead of mon_daemon freese) you can see the valuable difference between the current time and the time of this status.
So, if you press to the total or alarms button, you'll see this screen (below is a very simple example of it):
|
First line have the time of monitoring when tis status was calculated. Note - this is not the time when this screen was build, but the time for which this data are actual.
There is one or a few columns with the objects description below. The formayt of this description was defined already (above), with some shortages:
Operator can open the detail description (and additional menu) for every object:
An example of this output was shown in the Table 2 above. In the new (1.2) revision this view differ slightly by an extra menu bar on the top.
For any (ROUTER, CHANNEL) object you can open menu bar with the different buttons, to show graphs, reports, journal records, accounting archive, and so on. To call this menu for the channel, click on nthe channel name in any window. To call this for the router, click on the router's name or choose the router in the SUMM window.
This menus looks simular, below is an example (for the channel):
On the top line of this menu there is an object type (channel), object name (svyaz) followed by the data of the requested accounting. The + and - buttons (not shown on the sample) around this field can be used to change this data forward and backward, or it can be typed into this field directly. You can request monthly accounting instead of daily if enter year and month directly into this field in the form YYYY.MM.
The second row contain menu bar, with this buttons:
For the router object, additional buttons appear:
The graphs concerning router need an extra description:
|
First graph show the CPU utilisation (%); extra high (> 70%) utilisation is shown by the yellow color;
The second graph show the memory usage. This is slightly upside-down graph - it show really free memory, not busy one because just free memory is of any interest.
If your router allow, the extra 2 graphics are shown - the processor memory and the IO memory (are absent on our sample).
Blue marks mark router (and channel) failures.
The system journal is of the great impoirtance in this monitoring system. It is the set of the few journals where all messages and notes written by the operators are stored (one journal for every object, and one daily jornal for every day). In addition, the comments (tickets) system is joined with the journal system, this system store the tickets used to change the object status temporary or permanently. To call the journal, click on the [journal] button:
Comment to the current status()
|
This is one of the main operator's tools. It allow:
First table describe the current object state, after the Permanent comment was applyed to it, if such comment exist - for example, if there exist Permanent comment replacing state Failure to the state Important failure, just this, last (Important failure) state will be shown here.
Then the list of existing Permanent comments will be shown (absent in the example above).
Then follow the list of existing Comments to the event, and (if there is not any comment corresponded to the current state) the empty form for the creating such comment.
The last part consist of the object journal (it's form depend of the monitor revision and can differ slightly from the documented here).
First raws (on the white background) contain the ticket header, and (usially) should not be changed. The fields here defile the object type, object name and object status for which this ticket should be applied, just as the ticket type. If you want to create the ticket concerning the current object and the current object state, don't change this fields. Through, the system admin must change the type of the ticket to the 'Permanent comment' to create the 'Permanent comment'. Of course, you can create ticket concerning any (not current) state and even any (not current) object, through it is not recommended. All this 'on the white background' fields describe the 'starting state' for which this ticket must be applied.
The next ticket part describe which state should be installed instead of the starting one. The rules of installation object states instead of initial ones should be defined in the OPERATION GUIDE guide and depends of the company profile and other non-technical issues. The most common idea is to cause operators comment any RED event (E1, E4 and so on failures) after initial investigation, to remove any RED - colored alarms from the alarm screen. You can change the state descriptions by editing configuration file, and use your own states and state names.
The next raw determine the expiration rules for this ticket. First, you can limit the time of this ticket by few days, hours or minutes, and we highly advise to do it always when you do not want the ticket to be set forever. When the time installed in this fields is exceeded, the ticket will be removed authomatically.
Next button determine if this ticket must be removed when an object restore it's normal state, or not. Use 'yes' answer always when you are do not suspect a numerous sequential failures of the object.
If you create the ticket withouth the expiration time and withouth 'yes' in the 'Remove when state return to normal' button, this ticket will be stored in the system forever, until you remove it manually. This mode is not recommended for the often usage.
Next is the 'Comment' field, for the operator's comments and other information. We recommend to fill in this field ALWAYS.
In the 1.2 revision there is additional button raw defining if this ticket should be sent to the NOC staff, LINKS staff and/or to the link owner (if his e-mail address is available from the LINKS data base). It was not shown on our sample.
Last button raw define the operation about the ticket - create, remove, change it. The button 'Journal record only' allow you to make the journal record withouth creating the ticket in the ticket base, and (consequently) withouth changing an object state.
To edit or remove the 'Permanent comment', you should install this type, just as the starting state, in the ticket form, and then fill in any other ticket fields.
There is many journals in the monitoring system. Every monitoring object (except BGP for 1.2 version) have it's own journal; in addition there is common system journal splitted by per-day basys to the small day files. When you write something (create or edit ticket, or create journal record) about the object, system add records into the boths _object_ and _system_ journal. It allow to get all journal records for any object for all time of it's existance, or to read all journal records for the any operational day.
To open the system journal, click on the [journal] button in the main menu. Below is an example of such journal:
|
This records duplicate an object journal records. IN addition, there is possible to add any independent record here, withouth opening some object, by button write. In addition, the object name in the journal is clicable and search this object in the monitoring.
LINKS data base is slightly out of scope for this guide, because it should be big informational data base about all customers, providers, point of presense and devices for the particular company. The interface provided by the distribution pack allow to create, search and edit records describing channels and routers (monitoring objects), just as providers, reglaments and so on. To call this data base, you can use Search button of the main menu, or the card button of the object meny. Below is an example of the Search screen:
This guide described the monitoring system revision 1.2. A lot of rules should be defined by the OPERATION GUIDE, which depends of your marketing rules and company profile. Through some recomendations are common for any company: