Posts tagged system startup
Disclaimer: Now I know that this is an old idiom, I’m just presenting my own real life incident taken straight away from the bloody Java trenches.
Exceptions can be threads assassins
when running on top of Websphere thread pool, any Runtime exception that isn’t caught by the applicative code, will bubble up in the stack, ending up killing the specific thread. WAS helps here, by automatically creating a new thread that will take the place of the murdered one, but still, killing and immediately creating a thread is everything but the thread pool rational.
Hiring a thread bodyguard
A simple way to avoid thread death is wrapping the first applicative layer (e.g., Run() method) with a try block that catches and swallows any Exception that’s thrown from anywhere in the application code.
Our project’s code also used this concept, but instead of catch (Exception e), it had a catch (Throwable t), When I noticed that I didn’t rushed to fix it, just in case someone before me had done funky stuff with dynamic class loading that might throw ClassNotFoundError (although this should be caught at a very localized resolution), or maybe it’s there for some other historical reason that not being one the code’s forefathers I’m just not aware of. In any case, I did promise myself that I’ll revisit this piece of code in the future.
Getting some bulls to do correct things
today I finally got the excuse I needed in order to change the catch Throwable in a catch Exception:
We were running stress tests, when the server had an OOME (out of memory error). Since the catch Throwable caught and swallowed the OOME (as OOME is a subclass of Error which is a subclass of Throwable), the thread that generated the OMME kept on living, instead of dieing right there, and so, the JVM continued running, crippled and limping, instead of turning to an honorable solution like hara-kiri. Choosing the quick death route would have been rewarded with a quick resurrection to be provided by the gracious NodeAgent and its watchdog mechanism, and the end result would have been a newly born healthy server ready to get back in business. A retreat in order to attack, you might put it.
Instead, the server had to limp for long minutes, suffering from a series of consecutive strokes (OOME), until the OOME was so bad that the JVM just had to exit.
The Catch Throwable was causing down time, by preventing an imminent restart of the JVM due to an OOME.
- I know that an uncaught exception kills only the specific thread does the JVM treats an error differently? Put other words, if the OOME is not caught, will the entire JVM die or only the specific thread? I assume that the answer is the entire JVM, maybe this is implemented by the JVM itself, or maybe it’s implemented somewhere in the WAS bedrock. If for some reason it’s not the case, one could catch an Error and then execute System.exit(1); in order to hasten the process imminent death.
Yesterday night we had a scheduled power shutdown in the Dev lab. Today morning, the lab manager rose up early to get all servers running before the armies of developers arrive to the office. My work includes interacting with a Lotus Sametime server (IBM’s Instant Messaging (IM) server), so I run my own private IM server. starting the day’s work, I quickly noticed that my IM client fail to log in to the IM server. In fact, all of the developers could not log in to their own private IM servers.
Remembering that during the client log in process the IM server validates the client supplied log in credentials against a central LDAP server, the LDAP server became an immediate suspect.
The LDAP server we’re using is an old IBM ITDS LDAP server running on Win2000. It’s comprised out of two processes: the ITDS process that parses and execute LDAP queries, and a DB process (DB2) that takes care of data persistency. Both processes are registered as Windows services.
The investigation commenced! Maybe the LDAP server is down? I went a head and checked the ITDS and DB2 services status, both were running. Hmm… I moved on to inspect the ITDS log, and saw that during its start up stage it failed to create a connection to the DB2, therefore it resorted to starting in a, “crippled”, configuration only mode. That means that it just sits there, wasting random CPU cycles, giving the illusion that it’s there to provide service, but not actually answering any queries.
To remedy the situation I simply re-started the ITDS service. It started up normally and began servicing incoming LDAP queries from the IM servers.
At this point, you’ll be tempted to announce world wide: “I fixed it!”, but before you do that, stop and think about it for a minute; what is it exactly that you fixed? In did, the ITDS began servicing queries, and the client can log in, meaning that the current manifestation of the problem was eliminated, but did you fix the problem itself? Part of being pro-active means that you solve future problems before they actually occur. What stops the problem from re-occurring the next time someone decides to restart the server? in order to solve it for good, you first need to understand what was the cause of the problem.
So, what happens during a server start up? The ITDS and DB2 services startup-type is set on Automatic, thus they start when the OS starts. The db connection error message fits a scenario in which the ITDS service started and tried to connect to the DB2 before the DB2 service finished starting up.
We would like to instruct the ITDS service to be less hasty, and wait for the DB2 service to finish starting, before stepping into itself start up process. Educating it can be achieved by defining a service dependency, stating that the ITDS service is dependent on the DB2 service.
Implementing it: Dependencies can’t be created using the windows MMC “computer manager” snap-in GUI, so you’ll have to get your hands dirty with registry mud using the following procedure.
Problem uprooted! You won you pro-activity badge.