Well-Architected: What it Means in the AWS Cloud

Discover how Akkodis helps you optimize your AWS cloud solutions with Well-Architected Framework. Learn the importance of balancing performance, security, and cost for effective cloud management.

9 minutes

22nd of August, 2024

By James Bromberger, VP Cloud Computing, Akkodis Australia

There used to be an excellent phase in the Perl software development community called TIMTOWTDI. Otherwise known as Tim-toady, it's an acronym that says, “There is more than one way to do it.”

So much of life is true to this, but some ways are better than others, depending on what you’re measuring: speed, cost, efficiency, durability, flexibility, or observability. Many more adjectives can pull your understanding of “good” in different directions.

Cost optimization may be great, but too aggressive cost optimization can be a misnomer.

If your digital solution is only operational during business hours, turning the service off outside those hours may seem reasonable. However, if the users (clients) turn up early one day, and it takes you an hour to start the system back up again, then your clients may have a bad experience.

Choosing the Right Digital Solution for Your Organization

Your digital solutions for your company also come in three major approaches these days:

Roll your own with bespoke/custom development;

Roll your own, using Commercial Off-The-Shelf Software and/or Open Source Software;

Subscribe to a SaaS platform and make the operational trade-offs Someone Else’s Problem (S.E.P.).

S.E.P.” may seem like a great idea, but as a subscriber to a SaaS platform, you’ll be asking the same reliability, durability, and availability questions and validating the cost model.

“Rolling your own” solution gives your organization direct control of those operational decisions, many of which are chosen by the technical implementation team (probably known as the IT team).

However, your organization may be unaware of the choices your technical team makes regarding durability, availability, and the cost options available to it.

Indeed, the irony is that the team that costs you millions to billions of dollars/euros/yen in software selection and infrastructure purchases or consumption is often the same team that is not trusted for a $2,000 flight to a conference. This conference could help them learn how to save vast amounts of money for your organization.

Navigating the Proper Combination for Your Workload

With all these choices, how do you find the right combination for your workload?

Part of the answer is experience—long-term experience, with battle scars from managing digital workloads through incidents and issues while continuing to meet defined service level objectives (SLOs) for the workloads.

A more structured approach is the Well-Architected cloud engineering concept, which was born around 2014 within a small Solution Architecture team at AWS. This team started with five different perspectives to observe a workload. Some of these perspectives overlap or complement each other, and some are opposites. Finding the balance that fits your workload and organization requires a degree of trade-off.

These perspectives are deemed strong Pillars of the Well-Architected Framework. The concept is well documented, and it should be required reading for all solution architects and system administrators using the AWS Cloud.

The initial set of pillars were:

Operational Excellence

Security

Reliability

Performance Efficiency
Cost

They are all considered equally important until you ask a client. The naive initial response is that Cost Optimization is the only important pillar, in which case, turning off all cloud resources and deleting all data is the ultimate Cost Optimization. Sure, it sacrifices everything else, but it meets that initial requirement.

Balancing Performance, Cost, and Flexibility in Cloud Solutions

After some consideration, the view that performance efficiency starts to balance cost and user experience.

This is fine until you need to make some sort of change. If your deployed solution in the cloud doesn’t facilitate you to make changes easily, such as patching and security updates, then you may be stuck with a sub-optimal solution.

Case in point: AWS and its computing partners often make new generations of CPUs available in the Amazon EC2 service. Following Moore’s Law, these are often cheaper and faster, and that cheaper is often passed along to the customer. However, AWS cloud users have to take action to change virtual machine types. You’re welcome to keep using the older type, but you don’t get the possible 5% cost saving.

For many workloads, it’s a simple reboot to accomplish this, with a minor downtime on a single node (instance, VM). If you only have one node, your service will likely feel that downtime. If you had two virtual machines, you could do this one at a time and avoid a complete outage.

That choice right there: one big instance versus two (possibly smaller) instances is an example of operational excellence impacting availability/reliability. Quite often, the choice of a virtual machine at 2 x large versus two instances at x large size (in the same family) is often the same cost. Sure, you may add a load balancer, but your flexibility to direct traffic is vital.

Once you can selectively have downtime on individual service nodes without a complete service outage, you’re more likely to actively apply security patches.

An Example: To AutoScale or Not

For those unfamiliar with EC2 AutoScale, it is a way of automating the bootstrapping of a virtual machine from an image. That initial image may be a basic install or a private image that has been customized from some basic install.

The objective of using AutoScale is to remove the need for human (sysadmin) intervention during the commissioning of additional virtual machine(s). This is implemented by taking all of the installation tasks and scripting them, a process that can take some time to get just right. It's an investment of time that you’ll see pay off, and in ways you may not initially expect.

I highly recommend AutoScale and Elastic Load Balancing, even if you have a single instance. Although this may seem counterintuitive from a direct cost perspective, the reasons are solid.

When Scaling up and down

For spikey workloads, the ability to add additional processing capacity on demand, with minimal delay, is a panacea. You’re not overspending on unused capacity, and you’re not caught out on underspending. Additional hosts can come online within the limits you define, and older ones can be removed.

This forces a few good behaviors. Firstly, no logs should be left on instances. They should all be egressed to appropriate log retention and access services. This removes the need for developers to access a host to see logs—these developers and support folk should already have access to the logs in escrow. They also should have read-only access to protect the integrity of the logs.

When Replacing the Underlying Operating System

Over time, your OS will need replacing. Linux and Windows versions come along every few years, and if you have already been using AutoScale, then the work Over time, your OS will need replacing. Linux and Windows versions come along every few years, and if you have already been using AutoScale, then the work required to move forward will be somewhat easier. The best-case scenario is that the same bootstrapping works with the new base image.

When Patching Becomes a Long Process

Think of a brand-new Operating System release to run your workload in EC2. Zero patches are outstanding. Let’s say that after a period, one patch takes 10 seconds to apply to each fresh instance of deployment. After a year, there are ten patches, for 100 seconds, to apply from the vanilla base image.

At some stage, the time to apply pending patches from a vanilla base image becomes too long, and it's time to create a custom base image (AMI) with the current set of available patches already applied. Since you have already automated the bootstrap, adjusting to a partially prepared image should be straightforward.

An Additional Pillar: Sustainability

With corporate social responsibility being an important part of the modern enterprise, the Well-Architected Framework was adjusted to add a new pillar, sustainability, which was introduced to the Well-Architected Framework in December 2021.

This is closely related to Performance Efficiency and often aligns well with Cost Optimization. If you are using Serverless and Event-driven approaches to application architecture, you can significantly improve your solution's environmental impact.

Furthermore, some of your design objectives may benefit sustainability, for example, by reducing the frequency of your reporting, the detail or retention of data storage, or the complexity of your data model.

Zooming out to 20,000 ft

Taking these perspectives to guide the implementation and operation of your workloads brings multiple benefits:

Any interruption to service is minimized during a change being deployed;

You are more likely to make frequent minor improvements;

You have the flexibility to innovate as the cloud or your organization's requirements change;

Your time to value these improvements is improved as you can get them deployed sooner.

A key reminder is that the future is uncharted territory. You don’t know exactly how things will change, but you know that new security vulnerabilities will be discovered. New ways of encrypting and securing data will happen, new protocols to move data will be used, existing services will evolve, and new services will be launched that may solve some technical requirements. You should be reviewing and contemplating if these changes suit your workload.

How to Learn the Well-Architected Principles

There are multiple (free) courses on AWS Skill Builder to learn this. However, I would recommend sitting down with the documentation and reading it cover to cover. While doing so, contemplate a workload you have architected and implemented yourself and how you would change this.

Once this is done, use the Well-Architected tool in the AWS Console to evaluate your first workload. Always date the evaluation—perhaps by year and month—so you can compare the current evaluation to one you will do in the future.

Tracking the rate of change between Well-Architected reviews is a useful metric. Are you addressing technical debt, or is the list of improvements getting away from you? What does this mean to your organization – is it critical, important, information, or irrelevant?

My Personal Experience With AWS Well-Architected

In 2014, Well-Architected was born, and I worked as a Solution Architect at AWS. I participated in a few internal meetings around the initial concept. Key amongst this was trying to ensure customers made the right choices so as not to hit issues during rare events such as an Availability Zone availability event (otherwise known inside AWS as a large-scale Event).

As I started scaling the AWS practice and engineering at Akkodis, we used Well-Architected concepts to review workloads during initial design and deployment.

Several years later, while with Akkodis, I worked closely with some of the AWS team to contribute to the reference architecture for VPC design. I shared my reference VPC CloudFormation template, which implemented as many reliability, logging, and modernization options as possible for the time to define best practice.

Our focus on Well-Architected has meant that costs have stayed under control, and major incidents for others have been informational events for our clients.

Getting Help

Many AWS Partners in the AWS Cloud ecosystem offer Well-Architected Framework Reviews. These short engagements are very short (typically a week) and fixed cost, and they give you the benefit of experienced (and heavily certified) engineers providing insight into your cloud operations.

If you’re in the government/public sector, you may wish to have engineers who are national citizens, located in-country(and indeed, on-site with you), and with domestic security clearances.

Akkodis has been offering this service to our clients worldwide for some time, leveraging our expert consultant engineers from across the globe. You can read more about the Akkodis offerings and obtain an info sheet (English, Français, Deutsch, Italiano, Japanese/ 日本).

These external reviews often clarify and reassure you that existing delivery teams can address technical debt and optimize your workloads.

If you have inherited a workload from a merger, had a technical team disappear, or just want a point in time set of fresh eyes to verify you’re keeping up, then reach out to Akkodis in your country (or fill in your details on the info sheet form in your preferred language above).

At the end of the review, you can have your existing teams implement the suggested remediations, or you can find most consulting services organizations, including Akkodis, able to guide and assist the remediations (using on-site, on-shore, near-shore, or off-shore engineers—your choice).

Take the next step and contact us so we can help you optimize your AWS cloud solutions.