Wednesday, April 21, 2010

Intermedia Outage

Intermedia had another outage last week. Finally received their RFO written by Jonathan McCormick, Intermedia Chief Operating Officer:

I personally and on behalf of Intermedia apologize for the April 16 and 17, 2010 service outage you experienced. This letter is a follow-up to the information you received from Intermedia CEO Serguei Sofinski. As part of our commitment to transparency, it addresses the following items:
• Detailed Reason for Outage (RFO)
• Service Credit
• Corrective Action Plans
• Client Communication

Detailed Reason for Outage (RFO):
At approximately 6:15 a.m. PT on Thursday 4/16, a hardware failure occurred on one of the EMC storage area networks (SANs) located in Intermedia’s New Jersey data center. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node.
The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN, which caused a degradation of performance for the reading and writing of data to the SAN. The degradation of performance on the SAN in turn impacted the overall system’s ability to process email messages creating a queuing of several hundred thousand messages within the system. The back log was large enough that it took 32 hours for it to clear after the original event. At approximately 2 p.m. PT on Friday 4/17, all systems were functioning normally and mail delivery was considered to be “real-time.”
Service Credit:
In accordance with the terms of your SLA, a service credit for the above time period will be proactively applied to your account balance by the close of business on Friday 4/23.
Corrective Actions:
• Our SAN vendor analyzed the system logs for the event. The vendor determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. This bug caused the service processor to “panic” and automatically take itself off line. As the first corrective action, on Friday 4/17 at 11 p.m. PT, our vendor performed an emergency upgrade to the version of firmware running on the SAN. This newer version of firmware has a fix for the bug that caused the failure we experienced.
• Since the outage, as the second corrective action, we have added additional processing capacity to the SMTP hub farm in this domain. We have also performed performance tuning on the SMTP hubs to guarantee that they are able to more rapidly process a larger than normal queue of messages.
• Over the next several weeks, we will be taking additional corrective actions to make certain that there is enough spare capacity on the SAN to guarantee that it performs without performance degradation in the case of a single hardware failure. An additional SAN is being installed this week and starting as early as this weekend we will begin to migrate a portion of the existing systems to the new SAN. Additionally, we have engaged our SAN vendor to review the performance tuning of our SAN and implement adjustments to increase its overall performance capabilities. These events in tandem will guarantee that the SAN will be able to perform without an impact to the service in the event we experience another individual hardware error.
Client Communication:
We have received significant constructive feedback regarding our communication throughout the outage. We recognize the importance of proactive communication of timely, detailed information that clearly explains the current impact on your service.
Intermedia recognizes the fact that our current client notification tools and processes are more reactive than proactive and that they do not function well in an outage situation. For this reason, we have developed a new client notification tool that will be used by the Technical Support organization to proactively notify and communicate with clients during a service interruption. The new notification tool will be released at the end of April and will be put into operation during the month of May.
This new notification tool will equip the Technical Support organization with the ability to rapidly create a list of affected accounts and instantly generate an appropriate message to be sent to the account contacts of an affected account via both email and SMS (text messaging).
We will notify you when the notification tool has been implemented, as your account contacts will need to update their information with an SMS address to receive notifications.

I want to assure you that we recognize the importance of business communications and the negative affect it has on your business when the service is not available. Your feedback is always appreciated. We welcome your feedback regarding our service at Feedback@intermedia.net.This distribution list is monitored by the entire Intermedia management team.
Sincerely,
Jonathan McCormick
Intermedia Chief Operating Officer

No comments:

Post a Comment