Chapter 7. Administration

Once you have gotten your cluster configured, you will need to maintain it. This chapter will introduce you to the tools that you can use to monitor the function and performance of your cluster. These tools can also be use to modify the function of the cluster and troubleshoot any problems that may arise.

This chapter will focus on the following topics:

7.1. Administrative Tools

There are several tools that you can use to maintain your cluster. We've already covered some of the basics in Chapter 4, where we introduced the turboclusteradmin and tlclbconfig programs. These are the primary tools used to configure your cluster and make changes to it.

Most of the changes you will make to the configuration of the cluster will be adding new servers or services. This works pretty much the same way as the initial configuration. The main difference is that it is easier to add new servers and services after you have gotten the cluster running. That's one reason we recommend starting out with a simple configuration and adding more servers and services. It is easier to get a simple system working than a complex one.

Another reason you might want to change your cluster settings is to tune the cluster to optimize performance.

7.1.1. Tuning the Cluster

Once you have successfully gotten your cluster running, it is a good idea to tune it. There are several parameters that you can modify to optimize the performance of the cluster. The main settings that you can tune are the kernel table sizes and the time settings.

If you want to really optimize the performance of the cluster, you can use advanced network monitoring tools, such as network analyzers. You can determine how much overhead network traffic the ATM generates. This overhead includes heartbeat broadcasts, server pings, and ASA service checks. You can then fine-tune the frequency of the system checks. Just remember that when you increase the time between checks you also increase the time that services may be unavailable if one of the servers goes down.

7.1.1.1. Kernel Table Sizes

The first group of settings that you should modify are the sizes of the tables that the kernel module uses. The SpeedLink module has tables for servers, services, and connections. You should set these large enough to cover the maximum usage that your cluster will receive, but not too much larger. It should be pretty simple to figure out the optimal number of servers and services.

The number of services is pretty self-explanatory. Each named service will need one entry in the table. These services are defined in the `Service Settings' menu within tlclbconfig, with each service having its own `Service' line in the configuration file.

The number of servers is actually the number of server/service pairs. So if you have a cluster handling FTP and HTTP with 10 nodes, you would set the size of the servers table to 20. You may want to specify a table size slightly larger than what you currently need, to make it easier to add nodes later. You can always modify the settings when you add nodes, but you may forget.

The maximum number of connections is quite a bit harder to figure out. You definitely don't want to set it too low, because if the connections table fills up additional incoming connections will not be serviced -- they will simply be ignored. The best way to determine the optimum size of the connections table is to monitor the use of the cluster. Ideally, you want the connections table to be slightly larger than expected maximum number of connections that you will ever have at one time. A good rule of thumb is to take the largest observed number of connections and double it. This should cover just about any situation, unless your cluster suddenly receives a lot of extra traffic.

If your ATM is servicing NAT cluster nodes, you will need to double the number of connections. This is because the ATM creates a virtual connection from the client to the NAT-translated address, and a real connection from the NAT-translated address to the cluster node.

These settings are all configured in the `Advanced Traffic Manager Settings' menu in the cluster configuration tool. They are defined in the AtmPool section of the configuration file, and are named `NumServices', `NumServers', and `NumConnections'.

7.1.1.2. Time Settings

There are several time settings that you can use to fine-tune your cluster. The ones that pertain to the cluster as a whole are defined in the same place that the table sizes are defined. These include the connection timeout values and the heartbeat frequency. The other time settings are used to define the frequency of system and service checks.

The connection timeout value specifies how long to maintain an entry in the connections table for connections that are idle, with no communication occurring. It pertains to all services running on the cluster. Some recommended settings are:

  • HTTP 30-60 seconds

  • FTP 15-30 seconds

  • Telnet 300 seconds

If you are running more than one of these services, choose the longest one to use for the cluster.

Another set of parameters you can set are the frequency of service and server checks. Increasing these values reduces network traffic overhead, but increases the amount of time that a server may be down before it is removed from the cluster. You will basically have to decide where to make this trade- off. The network overhead really is not that great on a 100 Mbps network, unless you have a lot of cluster nodes. So you should probably stick with fairly frequent checks.

To change the frequencies of the system checks, go into the `Server Groups Configuration' menu in the tlclbconfig program. There you will find the `Frequency' settings for server checks and service checks. There are also `Timeout' values, which indicate how long to wait for a response before assuming that the server or service is down. These settings are called `CheckServerFrequency' and `CheckPortFrequency' in the `ServerPool' section of the configuration file.

The `Frequency' settings must always be longer than the corresponding `Timeout' values. Otherwise you would send out another ping before receiving an answer back from the first one. If you got an answer back, you wouldn't know if it was an answer to the first ping or the second.