Wednesday, December 2, 2020

Lessons Learned from the November AWS Outage

Context, Analysis, and Impact

  • Amazon’s internet infrastructure service experienced a multi-hour outage on Wednesday, November 25th, that affected a large portion of the internet.
  • More than 50+ companies were impacted, including Roku, Adobe, Flickr, Twilio, Tribune Publishing, and Amazon’s smart security division, Ring, in its region covering the eastern U.S.
  • Business impacts, as reported by WSJ, included:
    • New account activation and the mobile app for streaming media service Roku became hampered.
    • Target-owned Shipt delivery service could receive and process some orders, though it stated that it was taking steps to manage capacity because of the outage.
    • Photo storage service Flickr tweeted that customers couldn’t log in or create an account because of the AWS outage.

Tweets by companies experiencing outages.

  • Root Cause Analysis by AWS — It started with Amazon Kinesis but started impacting a long list of services. You can read the RCA document by AWS, which is also summarized below:
    Flowchart of AWS impact spread.

Lessons Learned

#1 — Don't Put All Your Eggs in One Basket

  • Using a single Cloud Service Provider can be counter-productive in these scenarios.
  • Think and strategize for Hybrid-Cloud or Private Cloud; or Multi-Cloud, particularly during peak season.

#2 — Hope for the Best and Plan for the Worst

  • Don't just rely on a cloud provider's availability and multi-region fail-over strategy; build your own resiliency and disaster recovery approach.
  • Practice disaster recovery in production or similar systems by using innovative approaches in active-active setup across the multi-cloud or hybrid-cloud scenarios.

#3 — Monitoring and Observability Are Not Static

  • Be innovative in exploring monitoring and observability patterns. For example, if AWS is reporting an outage on their status page, your monitoring system should get into action and inform the incident resolution team to start analyzing the impact.
  • Keep ready the services dependency graph — though mostly supported by tools, you should keep it dynamic and prepared to assess the impact when it happens and map it to business functionalities to report it to your business team accurately.

#4 — Invest in Emerging Techniques, like Chaos Engineering

  • This failure indicates that even internet giants like AWS are still maturing in implementing practices like chaos engineering. So, start putting chaos engineering practices into the roadmap.
  • For example, if a bulkhead pattern could have been utilized in the AWS outage scenario, the outage would have been limited to Kinesis services only.

To conclude, being proactive when outages occur, having a response team equipped for unplanned outages, and improving continuously from lessons learned along the way are essential techniques to help keep the impact limited. Also, having a multi-cloud or hybrid-cloud strategy is food for thought to keep the business running.



from DZone.com Feed https://ift.tt/33EHfd3

No comments:

Post a Comment