Friday, March 12, 2010

Intermedia Outage

Intermedia finally sent an RFO (Reason for Outage) for the major outage on March 5th. Sounds like the real problem was that they weren't keeping the firmware upgraded on their EMC boxes. I wonder if they knew that new firmware was available and were aware that there was a problem with the old firmware. I always recommend monitoring firmware versions in all hardware, reading the release notes whenever a new version comes out, and apply the firmware if it will fix a bug or prevent a failure. I've see this a lot with the Dell RAID firmware which seems to get upgraded every 6 months or so and fixes problems that they discover which will prevent a RAID failure. Hopefully this will be the last outage for a while.

Dear David Moss
Regarding your account: Tek24hour
As a follow-up to Intermedia’s CEO, Serguei Sofinski’s letter regarding the March 5, 2010 outage with your service, and in our continuing commitment to complete transparency, this letter addresses the following items:
• Detailed Reason for Outage (RFO)
• Corrective action plans
• Timeline of events surrounding the outage
RFO – Client Infrastructure
On March 5 at approximately 6 a.m. PST Intermedia’s monitoring system began to display alerts for high RPC (remote procedure call) latency on several of our Exchange database servers in multiple domains. Distributed applications and services within the Exchange domain communicate via RPC. The high RPC latency in-turn began to affect the front-end services within each domain that process mail flow and manage client connectivity. The RPC latency continued to increase and eventually hit a critical point that effectively prevented the processing of commands by the Exchange database servers which caused front-end services to back-up resulting in the queuing of mail and disconnecting of clients. Minutes later all cluster services on the Exchange databases began to fail. By 6:30 a.m. PST, all senior Intermedia engineers were engaged in resolving the issue.
The cause of the RPC Latency on the Exchange database servers was due to poor I/O (input/output) performance on one of our EMC CX3-80 SANs. This resulted in long wait times for reads and writes to the databases. The root cause of the poor I/O performance on the SAN was determined to be faulty hardware; specifically disk 14 in enclosure five (5) was in a partially failed state. By design this should not have affected performance of the EMC CX3-80 SAN or the Exchange database servers connected to the SAN.
The EMC CX3-80 is an enterprise SAN designed with redundant components. Each EMC CX3-80 SAN contains 32 enclosures that are serviced by redundant controllers, each with live service processors. Data stored on the SAN is striped across multiple disks within multiple enclosures. Each enclosure has 14 active disks plus a hot standby disk available within it to take over for failed disks. All Exchange database servers are clustered and each server within the cluster is multi-pathed via separate fiber connections and fiber switches to each service processor. The databases reside as single copies of data on the SAN.
Under normal operation, if the service processor on a SAN recognizes a faulty disk it will automatically bypass it and replace it with the hot standby. The hot standby then becomes part of the raid group and data is automatically redistributed to it as a background process. Because data is striped across multiple disks using bit parity, this action happens automatically without impacting performance of the SAN.
The failure that occurred on March 5 was unique in the aspect that the SAN did not perform as expected. The faulty disk was generating large amounts of soft SCSI errors, but the service processor failed to remove it from the raid group. Service Processor A continued to process large amounts of errors being created by the faulty disk, which in-turn made it unable to deliver data to the servers at an acceptable rate.
At approximately 9:30 a.m. PST the faulty hard drive was recognized as the root cause of the issue and was manually bypassed and performance of the SAN returned to normal. Based on experience, the time between 6:30 a.m. PST and 9:30 a.m. PST was spent trouble shooting more regular causes of performance degradation within the SAN and associated fiber network.
Due to their extreme sensitivity to high RCP latency, all Microsoft Clustering Services within the affected domains had failed. A cold restart of all nodes was required to return services to normal. A cold restart requires shutting down all servers and bringing them up one at a time until everything is back on-line. Although there were a dozen Intermedia system administrators focused on this task as their sole priority, the restoration of Exchange services took several hours and was completed by approximately 11:30 a.m. PST.
During the event, incoming mail was queued within the hubs and/or mail filters and then subsequently delivered throughout the afternoon to the Exchange database servers.
RFO – Corporate Communications and Account Administration tools
During the event, our ability to communicate status effectively was hindered by an outage of our corporate communication tools until 9:50 a.m. PST. The databases for www.Intermedia.net, Intermedia’s client control panel and Intermedia’s trouble ticket system were located on the affected SAN and therefore were not available during the SAN event. These systems were restored as soon as the SAN performance issue was resolved. All available personnel were directed to answer incoming customer calls. Intermedia logged over 2,000 incoming calls to our PBX and effectively answered more than 1,000 of those calls.
Corrective Action Plans
At the time of the event, Intermedia escalated the performance issue with the EMC CX3-80 SAN to both Dell and EMC senior support engineers. Both Dell and EMC have continued to evaluate the root cause of the event since the outage and have recommended an upgrade to the version of flare code (firmware) running on the affected EMC CX3-80. The newer version of flare code has improvements in the way the system processes different types of disk failures. It is the belief of both Dell and EMC that this newer version of flare code will prevent a recurrence of a similar issue. We are planning the upgrade at this time and expect to have it completed within the next 30 days. You will receive a maintenance notification when the upgrade is scheduled.
As a high priority for completion, no later than Q2, Intermedia will also be isolating corporate communication infrastructure from the same infrastructure that provides our Exchange services, guaranteeing that we will be able to communicate effectively with clients at all times during a service interruption. Additionally, we will be rolling out a new, internally developed, client communication tool in late Q2 that enables more efficient and proactive communication with our clients via SMS as well as email.
Timeline of Event (PST)
• 6:00 a.m. – RPC latency threshold monitors begin to alert
• 6:30 a.m. – Client services begin to be impacted
• 6:30 a.m. – Intermedia VP of Operations and COO are notified and the event is classified as a Severity 1 outage, critical response team is deployed
• 6:30 a.m. – 7:30 a.m. – SAN processing priorities are adjusted in an attempt to improve performance of the SAN without success
• 7:30 a.m. – 8:30 - a.m. indications of fiber path errors exist leading the team to trouble shoot potential fiber network issues
• 8:00 a.m. – Dell engineers are engaged to help identify root cause of the SAN performance issues
• 9:30 a.m. – Faulty hard drive is determined to be the root cause of the issue and is manually bypassed returning SAN performance to normal
• 9:50 a.m. – Control panel, www.intermedia.net and the ticketing system are back on-line
• 9:30 a.m. – 11:30 a.m. - all Exchange service are brought back on-line
• 11:30 a.m. – 5:00 a.m. - BlackBerry services catch-up and mail queues clear
We recognize the importance of business communications and understand the great responsibility we have accepted by being your chosen provider. I want to assure you that from the moment the outage was classified as a Severity 1 event, Intermedia’s most senior engineers were engaged and focused on resolving the issue as their sole priority. After any service impacting outage, we invest significant resources in analyzing the event in an attempt to continually improve the service levels we deliver. Your feedback is always appreciated and helps Intermedia better serve you. We welcome your feedback regarding our services at Feedback@intermedia.net. This distribution list is monitored by the entire Intermedia management team.
Sincerely,
Jonathan McCormick
Intermedia Chief Operating Officer

No comments:

Post a Comment