Monday, January 4, 2021

Moving Microservices from Mesos DCOS to Kubernetes

We were facing issues with our existing AWS infrastructure when large I/O requests were received. Let’s deep dive and find Why our existing architecture was not able to manage, what changes we did to make our infrastructure more stable and available to our end customers.

An AWS Infra of IoT application designed using EC2, VPC, S3, CloudFormation, Kinesis, Elastic Load Balancer for microservices, web-application, and database. Nearly 40 AWS EC2 instances for production environments that communicate with each other for DATA transmission. Mesos DC/OS used for microservices Container Orchestration. RabbitMQ is used for data configuration and DATADOG for monitoring

For a few microservices, data transmission and stability were causing issues specifically for cases when a user tries to fetch a larger size of data for a long duration. Devices were able to push data to the database but data load and display were causing issues of data loss or services failure. Because of higher I/O for Microservices, higher uses of CPU & Memories which was enabling Load Balancer and ultimately causing higher billing.

We decided to redesign Orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps infrastructure management tools.

Before we explore Docker Swarm and Kubernetes we brainstorm and define how we are using Mesos.

Apache Mesos gives the ability to run both containerized and non-containerized services in a distributed manner. Mesos designed with distributed kernels so API programming can be designed directly against the datacenter. In our case, MESOS DCOS configured as master/slave, based on requests database requests were managed. On service failure, Mesos master never restart services automatically which increases application downtime.

Challenges with Mesos

Existing infrastructure had frequent service failures which caused unavailability of infra for end-users and data loss and higher AWS billing.

·        Existing Infrastructure and Orchestration

o   Cloud: AWS

o   CI/CD: Jenkins

o   Programming Language: Python, JAVA, C, C++, etc.

o   Source Code: Github

o   Deployment strategy: Automation + Manual

o   Infrastructure Monitoring: Automation + Manual (Execution of Validation steps on Regular interval)

 

·        Current Strategy and Tools:

o   EC2 Auto Scaling Groups

o   Scaling based on CPU usage

o   DCOS Microservices on EC2

o   Notification on Slack and Via call/Email

o   Other Tools/services: Splunk, Looker, HA Proxy, S3, Graphite, Grafana

 

·        Challenges

o   CPU usage fluctuates based on customer and product usage

o   Frequent failure of services even after auto scaling

o   Frequent Downtime

o   Frequent patches

o   End Customer concerned about data loss because of Stability and availability

o   High AWS Billing due to multiple EC2 Instances

 

Docker Swarm

Docker swarm uses Docker API and networking concept so we can configure and use it easily. Its architecture can manage failure strongly. In Docker swarm, new nodes can join existing clusters as worker or master. Docker Swarm doesn’t allow to integrate third party Logging tools. Easy integration of Docker Swarm on different cloud service providers such as AWS, Azure, Google Cloud not available compare to Kubernetes.

Kubernetes

Kubernetes is easy to configure and light in size. In case of service failure, Kubernetes do Autoscaling and keep service available. Kubernetes is versatile and widely used.  Major Cloud services provide custom master support for Kubernetes.

As AWS provides a platform for Kubernetes Master, we decided to go with EKS.



Amazon EKS pricing model asks users to bear additional costs of 0.20 dollars per hour for each EKS cluster. This put us on thought but when we compare benefits, this should not be as bad as it sounds. As a user, we designed and deployed multiple applications with different namespaces and VPC ranges on a single cluster

We initiated the process for one cluster, migrated one service, and validated stability on Docker Swarm and Amazon EKS. As the other infrastructure was already on AWS, we found that the Docker Swarm configuration would be time consuming and would require many efforts to monitor and manage.

With EKS, we received support/guidance from Amazon to design and deploy services along with how we can reduce costs hence we decided to go with EKS.

Migrating to Kubernetes from Mesos

For environment creation, mapping, and deployment on EKS we used CloudFormation (YAML) templates.

CloudFormation: AWS CloudFormation provides a customized graphical and YAML based interface to create/manage/modify a larger number of AWS resources and mapping their dependencies. As CloudFormation is a service from AWS, any new service will be available to use.

Options such as Terraform which is open source and supporting major cloud platforms to set up infra as Code are available but we used CloudFormation as we have everything on AWS.

·        How EKS Helped:

o   AWS billing can be reduced by using EKS

o   Less number of EC2 Instances

o   Auto scaling using EKS

o   EKS monitoring services and alerts services

 

·        New Infrastructure:

o   Reduced EC2 Instance from 15 Medium to 3 Large

o   Removed Graphite

o   Autoscaling using EKS

o   Reduced Datadog and Pager duty Alert configuration costing and complexity

o   Prometheus + Grafana based Alert configuration

DATADOG: We configured Datadog with an extension of CloudWatch for monitoring EC2 instances and connected AWS services. Installed the Datadog Agent on instances enabled to collect system-level metrics at 15-seconds for memory, CPU, storage, disk I/O, network, etc.

Prometheus + Grafana: For additional alert and monitoring of the Kubernetes cluster, we configured Prometheus + Grafana.

Prometheus helps with capturing and retaining data of POD, container, systemd services, etc. We can use these data to analyze the stability and behavior of the application and environment.

GRAFANA uses data stored by Prometheus and gives graphical presentations of statistics and alert configuration for easy assessment.

Post Migration Best Practices

·        Maintain MTTR (Mean time to Respond/Resolve)

·        List down Critical conditions and Report

·        Immediate actions

·        Incident Reporting

·        Root cause analysis

·        Continuous improvement in Define Processes

 

Strategy to Achieve:

·        MANUALLY:

·        Perform validation steps on Regular interval

·        Debug when unexpected behavior observed

·        Follow define Steps of Runbook

·        Call or Email Dev Support Team if not resolved in stipulated time

·        Restart services if needed after taking logs of Existing failure

 

·        AUTOMATION UTILITIES:             

·        Continuous execution of define validation tools using Jenkins + Selenium/Dynatrace

·        Enhancing validation steps coverage of Python scripts

·        Notification on Slack channel

·        Pagerduty

 

·        ACTIONS:

·        Email if not resolved within 15 min

·        Escalate to Level 4 if not resolved within one hour

·        Escalate to Level 5 if not resolved

·        Get Environment up and Running

 

·        BEST PRACTICES

·        Observe the environment for a few hours

·        Create a root cause analysis document

·        Get Approval of identified root cause analysis from the Dev team

·        Gather resolution information from Dev Team

·        Gather immediate actions if the same RCA observed in the future to minimize Downtime

·        Update runbook for future reference

Benefits and Applications

·        AWS billing reduced by ~40 % in our case as EC2 count reduced to 3 from 15

·        Auto service restart based on scaling configuration helped in the availability of application

·        Data loss and end customer escalation reduced

·        More advanced way of Monitoring which helped DevOps Engineer to identify root cause quickly

Conclusion

 

When we talk about the conclusion regarding our case, we found EKS was more helpful as we found more stability of our application after changes into orchestration. With EKS, we observed service stability, auto scaling, load balancing which helped us retain product availability. It is also true that both Kubernetes and Mesos provide facilities for application deployment as containers on the cloud, based on different application needs, solutions may vary.

No comments: