Preventing a SysAdmin’s Worst Nightmare
By Stephen Gracon, Zerto Pre-Sales Engineer
This one may be a hoax, but this did in fact happen to one of my former employers (a company no longer around) while I was working there.
Back in the day I was a Solaris Administrator responsible for managing a main data center in Portland and 41 call centers spread across the US. Each call center had a server room with a pair of Sun E3500s for the core operator application and support systems such as IVR, ACD, and various others. As with most remote sites there were no technical staff other than what we nicknamed the VOKAD (Voice Operated Keyboard Assistant Device; i.e. me telling whomever picked up the phone what to type & click.)
The Chief Software Architect decided one day to develop a simple script that would delete all log files older than 180 days. Because this script was so simple he also decided to bypass standard development practices (QA & Staging) as well as Change Control and pushed this script out to each call center manually. Now while the script worked as his user login on a test directory, he failed to account for the fact that he was running his script out of the crontab of the root superuser vs. as himself. Now for those familiar with Solaris back in the day, instead of root’s home directory being “/root” like in Linux; the top level directory is “/”.
Did I mention the Architect decided to push out his script on a Friday?
I still remember getting the phone call at 4:23 a.m. from my co-worker saying “guess what, your turn” as he and a team of others had already been up half the night trying to fix all 41 call centers that had mysteriously fallen off-line with no way to log-in to the core servers. Three days of a “company down” situation and hemorrhaging revenue it was discovered that:
- The simple script didn’t check it’s working directory and happily ran from the crontab user’s home directory; root which in this case is “/”. It also used the wildcard “*” instead of something more limiting such as “*.log”. This was thought irrelevant because there were only log files in the directory so no point wasting four characters.
- Solaris is a very resilient operating system even after someone forcefully deletes (rm –rf) every file starting at “/” that was older than 180 days (with most servers having been built over a year previous.)
We were finally able to create a bootstrap environment on a development server at each location, and use the Solaris Jumpstart capability to bring the systems back to life. Had this been a virtual machine protected by Zerto, what took an entire team a full weekend to fix would have taken one admin a few clicks and a few minutes.
Not to lament on the past but frankly having to wake up to a cell phone call at 4:00 in the morning on a Saturday, work the whole weekend as a salaried employee (so no overtime) frankly was not fun and makes me so thankful that Zerto exists.