AWS Deep Dive

AWS Well-Architected Framework

Questions and Best Practices

Operational Excellence

Prepare
Automate Testing and Rollback

This section is about testing updated services/applications after deployment (and automating the rollback if these tests fail). The meta-point is to automate rollout testing for every stage of the development lifecycle, from individual (and potentially rapid) test builds to more “stable” alpha/beta/gamma deployments (and eventually into production).

Ensure a Consistent Review of Operational Readiness

Basically, have a standardized launch checklist. And a standardized service/app review checklist. Amazon has a whole standardized process for these “Operational Readiness Reviews”.

Use Runbooks to Perform Procedures

More checklists! These are step-by-step lists breaking down how to do something. Automate (or partially automate) whenever possible.

(It feels like there’s a space here for “self documenting runbook automation”, kind of along the lines or literate programming.)

Use Playbooks to Investigate Issues

Apparently the difference between a “runbook” and a “playbook” is that the former is about how to do something within or related to a particular service/application, while the latter is a slightly more open-ended document based around investigating issues or behaviors. (But not too open-ended though, as part of the goal of playbooks is automation as well.)

Given that the process of action and investigation seem isomorphic to each other (the difference is mostly in why you are doing the thing and the potential scope of what you’re looking at), I’m not sure why two different words are necessary…

It’s interesting that Amazon is specifically calling out the idea of using Python for automation and writing playbooks as Jupyter notebooks. So we really are borrowing a lot of ideas from literate programming here.

Operate
Define Workload Metrics

AWS divides metrics into two categories: Those that measure workload progress towards KPIs and those that measure workload health.

Collect and Analyze Workload Metrics

Basically, metrics by themselves are seldom useful — what’s useful is understanding how given metrics evolve with time, or in conjunction with other metrics.

Establish Workload Metrics Baselines

What is “normal”, anyway?

Learn Expected Patterns of Activity for Workload

What is “normal”, anyway? (Time series edition.)

Alert When Workload Outcomes are at Risk

Interesting automation tool: Amazon CloudWatch Synthetics. Basically an AWS service that lets you automatically interact with a service/application that you’ve deployed in order to generate metric data points. The idea here is to be able to catch emerging problems without having to rely on the users themselves running into them (alone). This is also probably a useful tool for continuously probing areas of a service/application that are more seldomly used.