7.5. Troubleshooting

Troubleshooting is an important part of maintaining your cluster. In this section we'll cover various tools that you can use to troubleshoot problems. These include the log files and the files in the /proc/net/cluster directory. We'll also take a look at the daemon startup sequence, which you can use as a reference to determine where things may be going wrong. Finally, we'll discuss some common problems and the steps that you can take to resolve them.

The first important point to remember when troubleshooting is that you must test the cluster from a system outside the cluster. Cluster nodes and ATMs cannot be used to test the cluster, because the aliases they create will cause each local system to respond to service requests locally. The client does not have to be on a separate subnet; it just needs to be a system that is not a member of the cluster.

Caution

You must test the cluster by accessing it from client systems that are not a part of the cluster. Testing your cluster from a system that is a part of the cluster may lead you to believe that the cluster is working when it is not, or that it is not working when it is.

Due to the way Turbolinux Cluster LoadBalancer 10 is implemented, systems within the cluster will usually process traffic destined for the cluster themselves, without the traffic having ever been looked at and processed by the ATM.

7.5.1. Log Files

Turbolinux Cluster LoadBalancer 10 writes information to several log files as it works. These log files are stored in /var/log, along with all the other system log files. Poring through log files can be rather tedious, but it can be a powerful tool for locating trouble areas. One thing that will help you to recognize problems is to observe the log files when the system is operating normally. This will give you a baseline reference, and you will be able to identify irregularities more easily.

The primary log file is clusterserverd.log. It contains most of the output from the clusterserverd daemon. We cover some of the output generated in this file in the Daemon Startup section below. The file also contains information about all the server pings and ASA service checks. If any servers or services go down, that information will be listed in this file.

The /var/log/messages file is a standard log file used by the syslog daemon to log kernel messages. The SpeedLink kernel module sends its output to this file, just like any other part of the kernel. If you turn on debugging, the kernel module will generate more output to be sent to this file. This extra information will list each packet that comes in from a client to the ATM and which cluster node it gets forwarded to. We will show you how to turn debugging on in the Section 7.5.3.3.

The CMC daemon logs some information into the /var/log/cmc.log file. This file mainly gives information about connections that browsers make to the CMC daemon. This includes SSL password and key exchanges as well as action buttons that are pressed, such as starting and stopping the ATM.

7.5.2. Daemon Startup

The clusterserverd daemon has a well-defined startup procedure that you can monitor to see where things might be failing. You can observe the progress of the daemon, and determine where it has diverged from the normal startup process.

You can use the following command to observe the output as it is generated:

# tail -f /var/log/clusterserverd.log

If you view the /var/log/clusterserverd.log file as the cluster daemon starts up, you will see something similar to the following sequence:

  1. The daemon will start up and issue the message:

    Starting Advanced Traffic Manager daemon
  2. Version information will then be printed, including the build date.

  3. The daemon will display the name of the system it is running on and the IP address:

    Running on atm1.turbolinux.usa (192.168.0.1)
  4. The configuration file name will be listed. The file used will normally be /etc/clusterserver/clusterserver.conf.

  5. The configuration file will be read and parsed.

  6. Any invalid lines in the configuration file will be listed, along with the problem with the line.

  7. If parsing fails, the daemon will display the following message:

     Bad Turbolinux Cluster LoadBalancer 10 configuration file! Going to idle mode

    and not perform any further processing. If you edit the configuration file to correct the error, you can send a HUP signal to the daemon to have it re-read the configuration file and continue the startup process. Use the following command to signal it to re-read the file:

    # killall -HUP clusterserverd
  8. The cluster's broadcast address and network mask will be displayed.

  9. The ip_cs module will be loaded if it is not already running.

  10. Any stale network interface aliases that exist that were created by Cluster Server (ones that have :cs0 as the alias part of their name) will be taken down.

  11. If it is listed in as an ATM in the configuration file, the system will be configured to start out as a backup ATM.

  12. If the system was configure as a backup ATM in the previous step, it will attempt to locate a primary ATM.

  13. If the system is a backup ATM and no primary ATM is found, it will begin the election process. The election process selects the backup ATM that appears highest in the configuration file and currently running and promotes it to primary ATM.

  14. The new interface aliases will be configured.

    • If the system is the primary ATM, an alias of the Ethernet card (usually eth0:cs0) will be configured with the cluster's virtual IP address.

    • If the system is a direct forwarding node, an alias (lo:cs0) will be created on the loopback interface with the virtual IP address of the cluster. It will also write a "1" to /proc/sys/net/ipv4/conf/all/hidden and /proc/sys/net/ipv4/ conf/lo/hidden in order to squelch ARP replies.

    • If the system is a tunneled node, the tunl interface will be brought up and an alias (tunl0:cs0) with the cluster's virtual IP address will be created. Bringing up the tunnel interface will load the kernel IP-IP module. The daemon will also write a "1" to /proc/sys/net/ipv4/conf/all/hidden and /proc/sys/net/ipv4/conf/tunl/hidden to make the tunnel interface ignore ARP requests.

    • If the system is a node using NAT forwarding, no changes will be made to the network interfaces.

    • If the system is the primary ATM and has nodes using the NAT method, the NAT gateway address will be created as an alias on the Ethernet card. This could be eth0 or eth1, depending upon which real IP address is in the same subnet as the gateway address. This alias will be named something like eth0:natg.

  15. If the system is the primary ATM, the server and service checks will be started.

    • Each cluster node will be checked, unless configured with the noping option.

    • ASAs will be run for each service on each node.

    If a server or service is found to be inactive, it will be marked as down and temporarily removed from the kernel tables.

  16. The daemon will wait until it gets a signal to shut down. If it is the primary ATM, it will continue performing the service and server checks, until it receives the shutdown signal.

  17. If it gets a shutdown signal, the daemon will clean up and exit. If the system is configured to be a cluster node, the IP aliases will be left as-is. If the system was acting as an ATM only, the aliases will be removed.

7.5.3. Using /proc/net/cluster

In addition to forwarding cluster traffic and maintaining several internal tables, the SpeedLink kernel module creates a directory in /proc that it uses to provide information and allow dynamic configuration. This /proc/net/cluster directory can be helpful when troubleshooting problems with the cluster.

Values written to the files in the /proc/net/cluster directory can directly change the values of variables in the kernel module. These files are the means by which the Cluster Server daemon communicates with the kernel module. Under most circumstances, you should allow the daemon to handle modifying these parameters. However, it is important to know what the parameters mean, so you can read the current values. You can also use these files to help debug problems.

Caution

Writing incorrect values to the files in /proc/net/cluster can cause your system to crash. You should allow the Turbolinux Cluster LoadBalancer 10 and CMC daemons modify these files. Only modify them by hand if absolutely necessary.

CMC allows you to look at most of these files and modify a few of the parameters. They are on the Status page in CMC, listed under the `Internal Module Status' heading. CMC does a good job of indicating what each piece of information in the files means.

We will cover the meaning and usage of each of the files in /proc/net/cluster:

7.5.3.1. /proc/net/cluster/config

The /proc/net/cluster/config file holds the sizes of the 3 main data structures: the number of services, servers, and client connections, respectively. You can dynamically change these settings by writing to the file. For example, to change the table sizes to 25 services, 10 servers, and 5000 connections, use the following command:

# echo 25 10 5000 > /proc/net/cluster/config

You can verify that the changes took effect by reading the file again:

# cat /proc/net/cluster/config

25 10 5000

Caution

If you write to this file, the SpeedLink module will be reset, causing all active connections to be dropped.

7.5.3.2. /proc/net/cluster/connections

The connections file contains a table of client/server pairings. This can be used to display current active connections, as well as persistent connections. Each connection is listed on a single line.

Each line in the file has the following format:

prot client:port cluster:port timeout node:port packets

prot

The protocol, either `tcp' or `udp'.

client:port

Source IP address and port number of the client system.

cluster:port

Virtual IP address of the cluster and the port number of the service.

timeout

Number of seconds until the connection times out.

node:port

Cluster node IP address and port number that the packet was forwarded to.

packets

Number of packets forwarded.

The following example shows an HTTP (port 80) connection from a client system at 1.2.3.4 connecting to a cluster with IP address 192.168.0.100. The packets are being forwarded to the cluster node at 192.168.0.4.

tcp 1.2.3.4:9645 192.168.0.100:80 98 192.168.0.4:80 113

Note that NAT connections will have two lines: one for the incoming connection and one for the connection between the ATM and the cluster node. This is a side-effect of the way that RFC 1631 specifies that NAT should be implemented. The connection between the ATM and the cluster node will show an address chosen from the NAT subnet as the source address.

7.5.3.3. /proc/net/cluster/debug

The debug file lets you determine whether to log additional debugging information or not. Normally this will be set to 0, meaning that only the normal logging information will be output. If you set this to a 1, additional information will be logged. To do this, issue the following command:

# echo 1 > /proc/net/cluster/debug

The additional logging information comes from the ip_cs SpeedLink kernel module, and is written to the /var/log/messages file. The additional information shows new connections to the virtual server and shows which node the traffic gets forwarded to. Activating these extra log messages can create a substantial impact on the performance of the ATM, so you should use it only when debugging problems with the cluster.

7.5.3.4. /proc/net/cluster/nat

The /proc/net/cluster/nat file contains the configuration settings associated with NAT forwarding. This is the same as the NAT Subnet setting in the configuration file and the turboclusteradmin tool, with one minor difference. While the configuration file uses an IP address and subnet mask, the nat file specifies the IP address and the number of bits in the subnet mask. So if your configuration file looks like this:

NAT

    Subnet 10.0.0.0 255.255.0.0

EndNAT

the nat file will look like this:

10.0.0.0 16

Note that the NAT Gateway setting does not appear in this file, because that setting is only used by the clusterserverd daemon. The kernel does not need to concern itself with the NAT Gateway.

7.5.3.5. /proc/net/cluster/servers

The servers file contains a line of information about each service running on each server node in the cluster. This is the same information that is contained in the `Servers' section of the configuration tool. Each line has the following format:

prot node:port cluster:port up weight method packets

prot

The protocol, either `tcp' or `udp'.

node:port

Cluster node IP address and port number for the service.

cluster:port

Virtual IP address of the cluster and the port number of the service.

up

Either `up' or `down' depending upon whether the server and service are running or not.

weight

A number indicating the weight of this server. A higher number means that the server will receive proportionally more traffic.

method

The forwarding method that is used on the server. Can also be `local' indicating that the node is also the primary ATM.

packets

Number of packets that have been forwarded to the service on this server.

7.5.3.6. /proc/net/cluster/services

The services file contains the virtual IP addresses and port numbers for all of the services that the cluster handles.

Each line has the following format:

prot cluster:port up persistence packets

prot

The protocol, either `tcp' or `udp'.

cluster:port

Cluster virtual IP address and port number for the service.

up

Either `up' or `down'.

persistence

This will be 1 if the service is set to be persistent or "sticky". Otherwise it will be set to 0.

packets

Number of packets that have been forwarded to the service on the cluster.

7.5.3.7. /proc/net/cluster/stat

The stat file contains some statistics pertaining to the operation of the cluster. These numbers are updated in real time, allowing you to watch as the traffic manager directs packets to various nodes. Writing to this file has no effect.

The 6 values displayed are as follows:

  • Number of services configured

  • Number of server nodes currently in the cluster, times number of services each server handles. (Same as the number of lines in the servers file.)

  • Current number of active connections

  • Total number of packets received by the cluster

  • Number of dropped packets

  • Number of new connections

7.5.3.8. /proc/net/cluster/timeout

The timeout file allows the timeout time to be changed. If a connection does not receive any traffic in the given amount of time, the connection will be assumed to be stale and will be closed. You can change the timeout value by writing a number to this file:

# echo 100 > /proc/net/cluster/timeout

However, any value written to this file will not be remembered if the cluster daemon is restarted. To permanently change the connection timeout value, change it in the `Advanced Traffic Manager Settings' menu in the cluster configuration tool.

7.5.4. Common Problems

In this section we will list several common problems and provide you with some hints that may help you resolve them. Be sure to also check the RELEASE.NOTES file. You can access the release notes and other documentation through the CMC home page or turboclusteradmin.

7.5.4.1. Synchronization Tools Fail

There are several requirements for the synchronization tools. The server receiving the content must be running sshd, the Secure Shell daemon. Any system that has had Turbolinux Cluster LoadBalancer 10 installed should have the sshd daemon installed and running. This will be more of an issue with tlclb_content_sync, because you any system that is receiving configuration information will be running Turbolinux Cluster LoadBalancer 10 and should therefore have SSH installed.

One thing you may need to check is your /etc/hosts.allow files on all the cluster nodes. They will need to have incoming SSH traffic enabled. The following line will accomplish that:

sshd : ALL

You can also limit SSH connections to just the systems within the cluster or your LAN if you desire.

You can eliminate warning message in the synchronization tools by removing any servers that do not have SSH from the list of servers to be synchronized. Just be sure to always synchronize their content by hand.

7.5.4.2. Verifying That the Cluster is Working

To verify that the cluster is working, you can simply monitor its activity by using CMC or looking at the /proc/net/cluster files directly. The connections and stat files will probably be the most helpful.

When generating traffic to test the cluster, always make client connections from systems that do not reside in the cluster. If you try to connect to a clustered service from a node within the cluster, it will not go through the ATM. This is because the traffic is being sent to the virtual IP address of the cluster, but we have convinced the cluster node to accept traffic being sent to that address. Since the traffic is not going through the ATM, it is not subject to the forwarding procedures that make the cluster work.

To verify that the cluster will properly handle an ATM or cluster node going down, simply take that system off-line. The easiest way to do this is to remove the network cable from the system. It is a good idea to test your the reliability features of your cluster as soon as you get it configured the way you want it. You don't want to find out that the cluster is misconfigured when something really does go wrong with a system.

If you are testing by disabling a cluster node, you should see in the ATM's log file that the pings and ASAs have failed and that the system is being taken out of the cluster. Any open connections that were made with that system will be dropped. Service should otherwise continue as usual.

If you have disabled the primary ATM, the backup ATMs should notice this and elect one to be promoted to primary ATM. Within several seconds, normal service should have resumed. Any connections that were active when the ATM went down may be lost, but new connections should be made initiated without any problems.

7.5.4.3. Determining Which ATM is the Primary

The first system that comes up and is listed in the list of ATMs for the cluster becomes the primary ATM. All the other systems listed as ATMs will become backup ATMs. So if you want a particular system to be the primary ATM, make sure it is the first ATM to have the clusterserverd daemon brought up.

Note that if the original primary ATM goes down and comes back up, it will not be promoted to primary ATM. The current primary ATM will always remain as primary unless it goes down.

The best method of determining which system is the primary ATM is probably to use CMC on the virtual IP address. This will always end up connecting to the primary ATM. The name of the ATM will be printed directly below the row of icons.

You can look at the log files on each system to determine what role the system has taken. Another way to determine if a given system is the primary ATM is to look at the output of ifconfig. If the network alias (:cs0) is created on a real network interface (such as eth0) then the system is the primary ATM. Other systems will have the alias on the loopback or tunnel interface.

7.5.4.4. Cluster Generates a Lot of Extra Traffic

If your NAT settings have been misconfigured, you may notice a large amount of extra traffic on the network. Double-check to see that your NAT settings are correct. If all your NAT systems are working properly, there should not be any spurious traffic.