 |
First off, I want to say that I was relieved by the news that there were no injuries related to the data center fire that occured in Houston on Saturday afternoon. The fire occurred in the power room of a popular web-hosting company called The Planet. |
An official statement from CEO Doug Erwin was released on The Planet's forum approximately 6 hours later:
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
We are in the process of communicating with all affected customers. we are planning to post updates every hour via our forum and in our customer portal. Our interactive voice response system is updating customers as well.
There is no impact in any of our other five data centers.
I am sorry that this accident has occurred and I apologize for the impact.
The knee-jerk reaction for customers to an event like this is disappointment. I'm certain many customers were angered by the outage. Our company is a loyal customer of The Planet, and I couldn't help but wonder myself whether power would be restored to the data center fast enough.
I first heard about the outage after I received a call from one of my staff in the early afternoon yesterday. The first thing I did was visit the Website. There was no announcement on the site or the blog, but a window was inviting me to a live help session - I accepted the invitation. Within minutes, I had received a response from the live help agent which included the above noted text describing the situation, and although the agent I was chatting with was not providing any specific answers to my ETA questions, the person handled my questions calmly, professionally, and forwarded me to their support forum.
While I found the support forum to be an excellent update tool, it was extremely slow in loading (at one point over 2500 people were on the page). It got worse when Slashdot reported the incident on their Website, and was routing its readers to the forum. At one point, the forum went down, but it managed to get back online quickly. A slight worry lifted when one of The Planet staff posted a comment about the overload on the site brought on by readers of Slashdot, and not related to the outage issue affecting the Houston (H1) data center.
Later, I contacted support by telephone in hopes of getting an ETA, and was greeted by a voice-message describing the problem with the H1 data center. I was then routed to technical support and spoke to a gent that did sound slightly worn from taking numerous calls from irate customers. I had asked some specific questions related to contingency, and when it seemed like I wasn't getting anywhere in terms of receiving an ETA, I thanked him and ended the call.
This entire event left quite an impression in my mind. The first is that although there were numerous blog's, message boards and news items that were talking about the situation, the general consensus among many people (both customers and non-customers) was that The Planet was doing a superb job in updating its customers.
I agree with this statement, and believe that much of the reputation management drill was superbly held together by the online forum updates (rather than overloading their phone system), time oriented announcements, calm and courteous support people both online and over the phone, and phone messages in case you were out of the online loop. Keep in mind, that this was a server farm and the claim at the time of the incident was that it effected roughly 9000 servers (some 7500 customers). Muliply this with fact that many of those servers were maintained by resellers who host 20 - 100 times as many sites from each server, and you could start to get a sense of the magnitude of this outage.
And there certainly was enough online evidence that a fair share of people were not at all happy about the situation. While I also agreed with many of this bloggers points, I'm of the opinion that The Planet rallied hard, with constant effort to earn significant reputation management points over the weekend. A feat that was not easy, as the catastrophic loss of power to their H1 data center mixed-in with the duty to balance the constraints brought on through the opinion of the fire department, and seeking the expertise of outside vendors who were asked to work around the clock on their days off.
An uphill battle perhaps, but The Planet were back online in time for me to share my account of the experience with you this morning, and in my mind, they deserve top notch marks in its efforts to recover from the outage, while reserving some recognition that attention needed to be paid to reputation risk that could have been brought on from less than sympathetic customers.