This page provides a light introduction to the data you should gather and the places to look when you experience a problem. When your application server stops responding, gather data as quickly as you can. Once you have gathered the data necessary, you may be able to obtain bypass to your current problem by restarting your application server. Even if restarting helps, consider this temporary and expect the problem to return.
You get a "hung" application server when the rate of processing requests is slower than the rate at which they are coming in. This is usually caused by a resource constraint - something that your code depends on is either responding too slowly to keep up or isn't responding at all.
When troubleshooting problems with an application server, I generally follow these four steps:
Thread dump from the java virtual machine (kill -3 PID).
This is first because it is very important. Obtain at least 3 thread dumps with a pause of about 15 seconds between each one. Running kill -3 will not terminate your java virtual machine. It just sends a signal to the java virtual machine indicating that it should dump the thread state. (Unless you are running one of the patch levels for 1.3.1 that has a bug where it might crash during a thread dump, but even that is rare) For most JVM's, you will find the thread dump in the standard out log. For the IBM JVM, it creates a javacoreTIMESTAMP.txt file in the directory where Java was started from.
Network connection data (netstat -an >/tmp/netstat.out)
This will print out the state of all network connections coming into or going out of your java application server. It is useful when looking for a buildup in connections to services that you depend on. It can also be used to look for other issues, such as CLOSE_WAIT or issues related to TIME_WAIT. It helps to have this from a time that the application is performing well for comparison. Make sure you get it from a high volume time.
View CPU utilization on all hosts in question
$ vmstat 3
The first line in vmstat is often an average of those figures since the server was booted so don't let it throw you off. Take a quick glance at the lines after the first to see if you are running low on CPU resources. Look at both the run queue (usually the 'r' column on the far left) as well as the idle CPU column (usually near the far right side. If your run queue is much higher than the total number of CPU's in the box - or if idle CPU is hitting 0 frequently, you might be running into a bottleneck on CPU resources.
Because BEA Systems Inc. has provided a great tutorial on how to gather data related to a high CPU condition, I'm not going to duplicate it here. They wrote this with WebLogic in mind, but the process really is the same for all application servers.
The important thing to keep in mind is that we are only gathering the data at this point. So don't do the hex conversions and thread dump analysis just yet, you may want to move on and attempt to get bypass before spending time analyzing the results.
File descriptor and connection information
$ pfiles PID >/tmp/pfiles.out
Because every network connection uses a file descriptor (Windows refers to them as file handles), this information may prove very useful when trying to determine what your threads that are waiting on if they are backed up in RMI or on a socketRead. An unusually high number of connections to a remote service may indicate that the service in question is running slow or not responding at all. Be careful not to get distracted by a pooled resource (like JDBC) that keeps connections open even when not in use. You can check to see if the number of connections for a pooled resource is appropriate for it's configuration though.
Back up all logs to a directory where the data can be preserved for analysis.
I suggest creating a directory with date and time stamp and then copy all logs and data into it for later analysis. This will prevent it from being lost if your application server has automatic log purging configured. It will also make for easy troubleshooting because the logs end at the time of the event and you won't have to do much filtering of data that happened after the event.
**Obtain the usual data on your Database (CPU, I/O, Memory use, etc) - and look for (or ask your DBA for) data related to long running queries for analysis later.
Start with my introduction to analyzing thread dumps. When analyzing a thread dump, the first thing to do is a quick glance through all of the threads to see if you can identify a pattern. If you're application server is out of threads, you'll often see that many threads are stuck doing exactly the same thing - and this is the cause for the apparent hang of your application server.
If you can't find a pattern, I suggest switching to the verbose GC logs and looking to see if there was a serious issue with memory. When looking at the verbose GC data, I usually look for these things:
Ok, so "frequent major collections" is a bit vague. In general, what you're looking for is a low memory condition. In the JVM, a low memory condition is terrible - it often causes the process to spend more time doing collections than doing real work.
If you can't find a pattern in the thread dump and the verbose GC data seems fine, you'll want to start looking at the remaining data. Note that when analyzing the application server logs, it's useful to compare errors found during the outage to logs found while the site was fine. This might keep you from getting distracted by issues that are unrelated to your outage.