We were facing issues with our existing AWS infrastructure when large I/O requests were received. Let’s deep dive and find Why our existing architecture was not able to manage, what changes we did to make our infrastructure more stable and available to our end customers.
An AWS Infra of IoT application designed using EC2, VPC, S3,
CloudFormation, Kinesis, Elastic Load Balancer for microservices, web-application,
and database. Nearly 40 AWS EC2 instances for production environments that
communicate with each other for DATA transmission. Mesos DC/OS used for
microservices Container Orchestration. RabbitMQ is used for data configuration
and DATADOG for monitoring
For a few microservices, data transmission and stability were causing
issues specifically for cases when a user tries to fetch a larger size of data
for a long duration. Devices were able to push data to the database but data
load and display were causing issues of data loss or services failure. Because
of higher I/O for Microservices, higher uses of CPU & Memories which was enabling
Load Balancer and ultimately causing higher billing.
We decided to redesign Orchestration and find an alternative to Apache
Mesos. Docker Swarm and Kubernetes are the leading and highly used container
orchestration tools and it is used for DevOps infrastructure management tools.
Before we explore Docker Swarm and Kubernetes we brainstorm and
define how we are using Mesos.
Apache Mesos gives the ability to run both containerized and non-containerized
services in a distributed manner. Mesos designed with distributed kernels so
API programming can be designed directly against the datacenter. In our case,
MESOS DCOS configured as master/slave, based on requests database requests were
managed. On service failure, Mesos master never restart services automatically
which increases application downtime.
Challenges with Mesos
Existing infrastructure had frequent service failures
which caused unavailability of infra for end-users and data loss and higher AWS
billing.
·
Existing Infrastructure and Orchestration
o
Cloud: AWS
o
CI/CD: Jenkins
o
Programming Language: Python, JAVA, C, C++, etc.
o
Source Code: Github
o
Deployment strategy: Automation + Manual
o
Infrastructure Monitoring: Automation + Manual (Execution of
Validation steps on Regular interval)
·
Current Strategy and Tools:
o
EC2 Auto Scaling Groups
o
Scaling based on CPU usage
o
DCOS Microservices on EC2
o
Notification on Slack and Via call/Email
o
Other Tools/services: Splunk, Looker, HA Proxy, S3, Graphite,
Grafana
·
Challenges
o
CPU usage fluctuates based on customer and product usage
o
Frequent failure of services even after auto scaling
o
Frequent Downtime
o
Frequent patches
o
End Customer concerned about data loss because of Stability and
availability
o
High AWS Billing due to multiple EC2 Instances
Docker Swarm
Docker swarm uses Docker API and networking concept so we can
configure and use it easily. Its architecture can manage failure strongly. In
Docker swarm, new nodes can join existing clusters as worker or master. Docker
Swarm doesn’t allow to integrate third party Logging tools. Easy integration of
Docker Swarm on different cloud service providers such as AWS, Azure, Google
Cloud not available compare to Kubernetes.
Kubernetes
Kubernetes is easy to configure and light in size. In case of
service failure, Kubernetes do Autoscaling and keep service available.
Kubernetes is versatile and widely used. Major Cloud services provide
custom master support for Kubernetes.
As AWS provides a platform for Kubernetes Master, we
decided to go with EKS.
Amazon EKS pricing model asks users to bear additional
costs of 0.20 dollars per hour for each EKS cluster. This put us on thought but
when we compare benefits, this should not be as bad as it sounds. As a user, we
designed and deployed multiple applications with different namespaces and VPC
ranges on a single cluster
We initiated the process for one cluster, migrated one
service, and validated stability on Docker Swarm and Amazon EKS. As the other
infrastructure was already on AWS, we found that the Docker Swarm configuration
would be time consuming and would require many efforts to monitor and manage.
With EKS, we received support/guidance from Amazon to
design and deploy services along with how we can reduce costs hence we decided
to go with EKS.
Migrating to Kubernetes from Mesos
For environment creation, mapping, and deployment on
EKS we used CloudFormation (YAML) templates.
CloudFormation: AWS CloudFormation
provides a customized graphical and YAML based interface to
create/manage/modify a larger number of AWS resources and mapping their
dependencies. As CloudFormation is a service from AWS, any new service will be
available to use.
Options such as Terraform which is open source and
supporting major cloud platforms to set up infra as Code are available but we
used CloudFormation as we have everything on AWS.
·
How
EKS Helped:
o
AWS
billing can be reduced by using EKS
o
Less
number of EC2 Instances
o
Auto
scaling using EKS
o
EKS
monitoring services and alerts services
·
New
Infrastructure:
o
Reduced
EC2 Instance from 15 Medium to 3 Large
o
Removed
Graphite
o
Autoscaling
using EKS
o
Reduced
Datadog and Pager duty Alert configuration costing and complexity
o
Prometheus + Grafana based
Alert configuration
DATADOG: We configured
Datadog with an extension of CloudWatch for monitoring EC2 instances and connected
AWS services. Installed the Datadog Agent on instances enabled to collect
system-level metrics at 15-seconds for memory, CPU, storage, disk I/O, network,
etc.
Prometheus + Grafana: For
additional alert and monitoring of the Kubernetes cluster, we configured
Prometheus + Grafana.
Prometheus helps with
capturing and retaining data of POD, container, systemd services, etc. We can
use these data to analyze the stability and behavior of the application and
environment.
GRAFANA uses data stored by
Prometheus and gives graphical presentations of statistics and alert
configuration for easy assessment.
Post Migration Best Practices
·
Maintain MTTR (Mean time to Respond/Resolve)
·
List down Critical conditions and Report
·
Immediate actions
·
Incident Reporting
·
Root cause analysis
·
Continuous improvement in Define Processes
Strategy to Achieve:
·
MANUALLY:
·
Perform validation steps on Regular interval
·
Debug when unexpected behavior observed
·
Follow define Steps of Runbook
·
Call or Email Dev Support Team if not resolved in stipulated time
·
Restart services if needed after taking logs of Existing failure
·
AUTOMATION UTILITIES:
·
Continuous execution of define validation tools using Jenkins +
Selenium/Dynatrace
·
Enhancing validation steps coverage of Python scripts
·
Notification on Slack channel
·
Pagerduty
·
ACTIONS:
·
Email if not resolved within 15 min
·
Escalate to Level 4 if not resolved within one hour
·
Escalate to Level 5 if not resolved
·
Get Environment up and Running
·
BEST PRACTICES
·
Observe the environment for a few hours
·
Create a root cause analysis document
·
Get Approval of identified root cause analysis from the Dev team
·
Gather resolution information from Dev Team
·
Gather immediate actions if the same RCA observed in the future to
minimize Downtime
·
Update runbook for future reference
Benefits and Applications
·
AWS billing reduced by ~40 % in our case as EC2 count reduced to 3
from 15
·
Auto service restart based on scaling configuration helped in the availability
of application
·
Data loss and end customer escalation reduced
·
More advanced way of Monitoring which helped DevOps Engineer to
identify root cause quickly
Conclusion
When we talk about the conclusion regarding our case, we found EKS was more helpful as we found more stability of our application after changes into orchestration. With EKS, we observed service stability, auto scaling, load balancing which helped us retain product availability. It is also true that both Kubernetes and Mesos provide facilities for application deployment as containers on the cloud, based on different application needs, solutions may vary.