Designing Resilient Cloud Architectures

8 min readJun 5, 2022

Designing Resilient Cloud Architectures

[DRAFT]

(1) Multi-Tiered Architecture

Solutions should be developed based on access patterns, evaluating scaling strategies for components, determining database needs, and selecting appropriate compute and storage services in the cloud environment.

Access Patterns

Incorporating access patterns refers to designing with regards to who and what needs to access the particular architecture, and how they will access. Consideration must be given to where connections and requests are coming from. Architecture builds can change drastically between access provided via the public internet, like with a public-facing web application, and requests made solely from on-premises users access through a virtual private network, or VPN.

The temporal component must be a consideration as well. This involves determining or projecting what the access pattern looks like from hour to hour and day to day, and determining the shape of traffic over different periods. Another factor relates to consistency in traffic or transaction size , and if data involved in transactions varies or are mostly consistent. Does your application need to be ready for large volumes of unexpected traffic? Access patterns play a large part in determining how to design and update your architectures.

Scaling

A closely related topic is scaling, or more specifically, the consideration and determination of scaling strategies in terms of the component architecture. Decisions must be made in terms of how scaling can occur at each level of a multi-tiered architecture, which involve the possible use of both horizontal and vertical scaling methods. There are different methods used to implement scaling, and selection should also give consideration to:

how the various services handle elasticity,
methods to automate scaling needs, and
service limitations of each — number of instances, VPC’s etc. that can be concurrently deployed before performance or stability issues become identifiable.

Requirements-Driven Design

Naturally, any design or architecture should focus primarily on the actual requirements. The selection of database, storage, and compute services and implementations for the system being considered will change based on any changes to requirements. From this come a number of options:

will the data storage component utilize a database on instances, versus utilizing a managed service like Amazon Relational Database Service or Amazon DynamoDB?
if instances are chosen, how do the various instance families affect what server environment will be used, and how they will be deployed?

In AWS ecosystem there are the options of instance-based storage as well as Amazon Elastic Block Store. In terms of storage services, there are also many options, including Amazon Simple Storage Service and Amazon S3 Glacier.

(2) Highly-Available and Fault-Tolerant Architectures

This area covers the evaluation of fault-tolerant resource needs, failure mitigation, disaster recovery strategies, and several key areas.

Fault-Tolerant vs Highly-Available

It’s important to under understanding the distinction between fault-tolerant and highly available architectures

While there is definitely some overlap between the two architecture concepts, they do vary slightly in definition.

Designing for high availability means designing for minimal downtime. This style focuses on reducing the negative impacts to the end user by prioritizing restoration of essential services when a component or application fails.
Designing for fault tolerance means having an architecture that requires zero downtime and service interruption. It generally means higher costs due to significantly more replication and redundancy. There is also considerably more complexity involved because of the requirement for the management of data replication and component redundancy.

While there are differences between high availability and fault tolerance, know they are not mutually exclusive.

Resource Requirements

Irrespective of the style of architecture, there are other considerations that need to be determined. One of those determinations is evaluating what resources are needed. This includes the redundancy of your normal operational components, such as servers and databases, and the use of the AWS global infrastructure. It also includes components needed to handle data replication, traffic management, failure detection, and everything else necessary for a highly available or a fault-tolerant architecture.

In the AWS world, there is a requirement to evaluate which AWS services you can use to improve the reliability of your architectures. This is especially important with legacy applications, where component migrations are not possible. If components or applications cannot be run in AWS, or you have reasons why the go-to services for something like data replication don’t fit your use case, you need to know how to meet reliability requirements.

Determine Single Points of Failure (SPOFs)

It is critical when designing highly available configurations to mitigate single points of failure, as well as selecting appropriate disaster recovery strategies to meet your organization’s or business’s requirements. When trying to determine single points of failure, it often helps to think of working backwards from the failure scenario itself. One approach might be to take each component and assess what happens when that component fails. For instance, what is the scenario when one of the application’s web servers fail, when one of the application servers go down, etc. ? This is a fundamental form of scenario analysis similar to what’s performed in the risk management field. The system architect must evaluate these failures, then determine how they will be mitigated.

Disaster Recovery Strategies

In determine the requirements for a disaster recovery strategy, it’s common to look at the disaster recovery objectives. The recovery time objectives (RTO) and the recovery point objectives (RPO) will inform the decision on which strategies will work best.

Two types of disaster recovery strategies are “active/active” and “active/passive”, and each impacts RPO and RTO requirements in different ways.

“active/passive”: backup & restore, pilot light & warm standby solutions
“active/active”: multi-site active and active deployment solutions.

Key Performance Indicators

Handling high availability and fault tolerance is nearly impossible to do manually, but to automate the actions, you need to know what triggers need to be used, and what indicators will be used to kick off those triggers. Understanding what key performance indicators, or KPIs, give you the best visibility into the system architecture’s health is important to determining when to take action. To detect anomalies in an operating environment, monitoring and benchmarking are the primary ways to develop confidence intervals in terms of what “normal operation” looks like.

(3) Loose Coupling & Decoupling Mechanisms

Resilient architectures should include services to assist in achieving loose coupling of components, as well as the use of serverless technologies.

Decoupling refers to components remaining autonomous and unaware of each other as they complete their work as part of a larger system. This can be used to describe components within a simple application, or can be applied at a much larger scale.

Synchronous vs Asynchronous Decoupling

There are two fundamental decoupling techniques, synchronous and asynchronous integration.

Synchronous decoupling generally involves at least two components, and requires all components involved to always be available in order for things to function properly. An example could be using AWS Elastic Load Balancing to distribute traffic between instances across multiple Availability Zones. The instances are unaware of each other and do not depend on each other. For this type of configuration to properly function, there would need to be at least one instance operating in each of the applicable Availability Zones. All of the instances are doing the same type of work, but they’re doing it independently and generally do not need to communicate with each other. As long as the load balancer and the instances are all running, functionality will continue as expected.

Asynchronous decoupling, there is a multipoint connectivity matrix between the components and mechanisms so that communication between components can still be achieved even if one goes down. An example of this could be an architecture where instances are all processing messages and utilized a queue, such as Amazon Simple Queue Service, to handle the interprocess communication (messaging )between the instances. If one of the instances was to go offline, the messages would persist until an instance was available to retrieve the message.

AWS Queue Service Selection

A few quick notes on basic characteristics of AWS Queuing solutions:

Amazon Simple Queue Service (SQS) Standard Queue : this type of queue addresses the requirement for message persistence, and the front and backend instances would be able to put and pull from the queue as needed. However, it should be noted that it does not maintain message-ordering.
Amazon Simple Queue Service (SQS) FIFO Queue: This implementation of SQS utilizes a FIFO (first-in, first-out) queue which means it maintains the ordering of messages
Amazon Simple Notification System (SNS): while this service is able to handle the transfer of messages getting from one layer to the next, there are a few things that make this a lacking solution when compared to the SQS standard queue. (a) SNS does not guarantee message delivery, (b) SNS does not ensure message ordering, and © SNS is not a queue, and does not have message persistence. If a message is sent, and even if it is delivered in the proper order, the instance could go down and the message would be lost.
Amazon Kinesis: used for collecting, processing, and analyzing data making it ideally suited for ingesting streaming, real-time data. As a decoupling component, it is generally not as cost effective as utilizing one of the SQS variations.

Decoupling Using AWS Serverless Tools

Typical AWS tools used in the implementation of decoupled architectures include SQS, Amazon API Gateway, Amazon DynamoDB, and the many other services within the serverless toolkits available not only in AWS but also GCP and Azure cloud environments.

(4) Storage Resilience

This topic involves the processing of defining a data-durability strategy, identifying the impact of service consistency on operations, analyzing how access requirements direct the selection of data services, and the use of storage services with hybrid and non-cloud-native applications.

Determining Resiliency Requirements

In choosing the most appropriate resilient storage for a system architecture there is a high-level process which should be followed.

(a) Define the strategy to ensure the durability of data
Here it is important to understand how the various storage services handle durability, and what scenarios each of the different services are suited towards.
In the AWS world, there are several options to consider: when is it better to use Amazon Elastic Block Store (EBS) over Amazon EC2 instance store? In what circumstances is ephemeral storage ideal? Can Amazon Simple Storage Service(S3) and Amazon S3 Glacier be used interchangeably? Understanding the data durability management of each AWS service, and understanding the strategy and requirements for the system architecture will be important to finding the correct solutions.
(b) Identify how data service consistency will affect operations. Here it’s important to understand how data consistency is handled by AWS services, but also how calls to the service can be impacted.
© select data services that meet the access and other requirements of the application. It’s very important to understand the functionality of the cloud provider (AWS/GCP/Azure) services, as well as how the access-pattern requirements will be impacted by that functionality. For instance, a read-heavy workload versus one requiring fast read and write capabilities: each will have a different optimal solution. From an access perspective, another criteria would be whether the data needs to be accessible by a global user base limited to a single IP range. File size is another, as one service may favor working with a lot of small files while another is better for moving larger objects?

Hybrid and Non-Cloud Storage Services: While in a perfect world all systems would be cloud native, the reality is that it’s important to be able to identify storage services that can be used with hybrid or non-cloud native applications. There are a number of needs that you might be asked to meet, and there’s often overlap in the services that can meet those needs. For data migration there are many different options within the AWS Snow family of services. For data transfer, there are a number of purpose-built services (such as AWS Storage Gateway solutions) for use when consistent and regular data transfer between on-premises and AWS environments is required.

The system architect must understand the functionality of the available cloud storage services, see how they fit various use cases, and ensure the needs of hybrid environments are also understood.

<% > DRAFT — WIP v2 — Bruce Haydon, New York ©2021, 2022 <%>
Bruce Haydon works at the intersection or risk, technology and the treasury function in the financial services industry. Any code samples referenced in this reading can be found in the following Repository