Server monitoring is a service included with our DEFCON Management Levels. Depending on the level of support and management you receive, 24×7 monitoring of your server could be included. We are going to take an in depth look at the server monitoring service and how it can help you maintain the highest level of server uptime.
The above table displays “how much” monitoring is included with each level of DEFCON Management services.
In preparing this article, we brought in the engineers that designed our monitoring system and asked them some specific questions. This article was developed using a Question and Answer with the system administrators of FastServers.Net.
What software does FastServers.Net use for monitoring and why was it selected?
FastServers.Net recently migrated away from the IPMonitor network utility to the Open Source based Nagios (formerly net-saint) monitoring tool. Nagios was selected due to its extreme flexibility and the ability to customize the monitors to a level unachievable with the closed source IPMonitor.
When doing monitoring please describe the intervals in which our monitoring system checks availability?
Monitoring of services occurs every five minutes (300 seconds) for non-SNMP based monitors (Web, Mail, Databases, etc), and one minute (60 seconds) for SNMP based monitors.
When a failure occurs please describe the default course of action taken by our technical support engineers?
Our first course of action when we detect a service failure is to attempt to log into the server. If we are able to log into the server we then follow the QRR (Quick Response and Recovery) on file for the server which outlines the steps the server administrator/owner would like followed in the event of a server or service failure. If no QRR is on record, our staff will attempt to recover the failed service by starting the service if we are able to log in, otherwise if the machine is found to be unresponsive our staff will have the server rebooted.
What is a QRR and is it important to file one with FastServers? How would I file a QRR?
Quick Response and Recovery is the FastServers.Net procedure to perform in the event of system or port failure. These can be customized to meet the customer’s specific needs and notes on the recovery process can be added to his or her account. By default, every customer already has a standard recovery procedure that includes restarting the service and rebooting the server. In the event of any major failure or long term downtime, an email or phone call will take place. As each one of our dedicated servers is slightly different, FastServers.Net offers the ability to customize the recovery procedure. To get this started, email sales@fastservers.net for the full details.
Please describe SNMP (in general) and how this is integrated into our monitoring system?
SNMP stands for Simple Network Message Passing, and is a way to pass information (such as system load, disk and memory utilization, RAID health, etc).
Please describe common failures that SNMP finds and how we are able to take an active stance on servers?
SNMP allows us to remotely monitor a server’s overall condition. Mainly the load monitor is the most common failure and allows our staff to catch servers that may be acting abnormally.
Please discuss some of the issues we have with all monitoring on a day to day basis and how customers can help us overcome these?
The biggest problem our staff encounters is not being informed of maintenance work. Before stopping services it is imperative to notify our staff so we can disable the monitors and not interfere with whatever testing or maintenance is being done. A close second problem is modification of services without providing detailed steps for what service is answering on a specific port and how to (re)start the service.
Please discuss some of the more advanced features of monitoring, such as SQL/mySQL search pattern and how this can help maintain better uptime for customers?
One of the best and most reliable monitors is what we call a SQL QA monitor. This monitor checks a web page that makes a call to the SQL service running on the server. This SQL call returns a specific string of text that is stored only in the database and prints it to the web page. We then check this webpage for this string, which then informs us not only that the SQL service is running (the standard port check), but also that the calls are returning as expected.
Please describe the location of our monitoring systems and how this is a benefit to our customers?
Our monitoring software is on a dedicated network connection, separate from both our Fremont and Cedar Falls facility. This allows us to detect minor network disturbances as well gives our staff an outside perception of services.
On a daily basis estimate the total number of failures we handle in a 24 hour period?
In 24 hours we usually have to respond to 1200 alerts out of our 3500 or so normal alerts. Our monitoring software frequently catches customer reboots (which clear without us needing to interact). Sometimes, correcting one item resolves multiple monitors which are included in the 3500.
Please discuss what happens if we do not have root/admin access to a server when the alarm goes off?
If we can not log into a server to correct the downed service, we can not correct the service. At that point, we have little choice but to disable that profile until we can gain access. We disable monitoring, change the server’s DEFCON level to a temporary place holder, and notify the email address on file of our inability to log into the server and that we have ceased monitoring the server. Since we can’t log into the system, we can not perform service spot checks, making it impossible for us to stay ahead of the game and keep the server and running services up to date. During this time DEFCON support time is unavailable, so work is billable at our normal hourly rate.
Please discuss the human interaction of the monitoring system and how our technical support team keeps a watchful eye on systems?
All technical support staff on duty have the web page that monitors all of our monitored servers and services pulled up on a secondary monitor at all times. There is always at least one individual responsible for the bulk of the monitors (the Nagios point man), however all staff are able at any given time to detect and correct a potential problem.
Please describe the limitations of port monitoring.
Port monitoring is sufficient for most services. If the service stops responding the port will close. SQL servers (MySQL, PostGRES, and MSSQL), however, can maintain a “vegetative” state. In other words, the service is still bound to the port and giving the TCP handshake. The port is still open, but the server or service is not responding correctly. The QA based monitor for SQL services detects this problem, making this particular SQL monitor much more accurate.
Please describe when/how custom recovery procedures might be needed?
Custom recovery procedures are usually required when there is one particular service which out ranks the rest of the services on the server, or if your staff must know of any and all outages.
Please describe by default what type of monitoring we setup for Defcon 4, 3, 2, and 1 clients?
DEFCON 4 clients get a single HTTP port monitor (if installed by our staff); otherwise it gets a check to RDP or SSH depending on the OS (Windows vs. Linux).
DEFCON 3 profile servers get HTTP, SMTP, DNS, and SQL unless otherwise specified, along with SNMP monitors for disk utilization and load.
DEFCON 2, by default, gets the same services monitored as DEFCON 3, plus POP3 and FTP.
The DEFCON 1 profile expands on DEFCON 2 server monitoring by adding HTTPS and Control Panel monitors.
Of course, if a service does not exist we will substitute another running service for one that is not installed.
In conclusion, as a managed service provider our team of system engineers are watching your server 24×7. Involvement with the monitoring process, asking questions, customizing monitoring and recovery procedures to meet your exact needs will enhance your overall dedicated hosting experience. Here are three tips to maintain maximum uptime.
Tip 1: Always update our staff of password changes.
Tip 2: If you are rebooting your server, open a ticket letting our staff know.
Tip 3: Get involved and ask questions – information on procedures is freely available to you.