Monday, January 4, 2021

Moving Microservices from Mesos DCOS to Kubernetes

We were facing issues with our existing AWS infrastructure when large I/O requests were received. Let’s deep dive and find Why our existing architecture was not able to manage, what changes we did to make our infrastructure more stable and available to our end customers.

An AWS Infra of IoT application designed using EC2, VPC, S3, CloudFormation, Kinesis, Elastic Load Balancer for microservices, web-application, and database. Nearly 40 AWS EC2 instances for production environments that communicate with each other for DATA transmission. Mesos DC/OS used for microservices Container Orchestration. RabbitMQ is used for data configuration and DATADOG for monitoring

For a few microservices, data transmission and stability were causing issues specifically for cases when a user tries to fetch a larger size of data for a long duration. Devices were able to push data to the database but data load and display were causing issues of data loss or services failure. Because of higher I/O for Microservices, higher uses of CPU & Memories which was enabling Load Balancer and ultimately causing higher billing.

We decided to redesign Orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps infrastructure management tools.

Before we explore Docker Swarm and Kubernetes we brainstorm and define how we are using Mesos.

Apache Mesos gives the ability to run both containerized and non-containerized services in a distributed manner. Mesos designed with distributed kernels so API programming can be designed directly against the datacenter. In our case, MESOS DCOS configured as master/slave, based on requests database requests were managed. On service failure, Mesos master never restart services automatically which increases application downtime.

Challenges with Mesos

Existing infrastructure had frequent service failures which caused unavailability of infra for end-users and data loss and higher AWS billing.

·        Existing Infrastructure and Orchestration

o   Cloud: AWS

o   CI/CD: Jenkins

o   Programming Language: Python, JAVA, C, C++, etc.

o   Source Code: Github

o   Deployment strategy: Automation + Manual

o   Infrastructure Monitoring: Automation + Manual (Execution of Validation steps on Regular interval)

 

·        Current Strategy and Tools:

o   EC2 Auto Scaling Groups

o   Scaling based on CPU usage

o   DCOS Microservices on EC2

o   Notification on Slack and Via call/Email

o   Other Tools/services: Splunk, Looker, HA Proxy, S3, Graphite, Grafana

 

·        Challenges

o   CPU usage fluctuates based on customer and product usage

o   Frequent failure of services even after auto scaling

o   Frequent Downtime

o   Frequent patches

o   End Customer concerned about data loss because of Stability and availability

o   High AWS Billing due to multiple EC2 Instances

 

Docker Swarm

Docker swarm uses Docker API and networking concept so we can configure and use it easily. Its architecture can manage failure strongly. In Docker swarm, new nodes can join existing clusters as worker or master. Docker Swarm doesn’t allow to integrate third party Logging tools. Easy integration of Docker Swarm on different cloud service providers such as AWS, Azure, Google Cloud not available compare to Kubernetes.

Kubernetes

Kubernetes is easy to configure and light in size. In case of service failure, Kubernetes do Autoscaling and keep service available. Kubernetes is versatile and widely used.  Major Cloud services provide custom master support for Kubernetes.

As AWS provides a platform for Kubernetes Master, we decided to go with EKS.



Amazon EKS pricing model asks users to bear additional costs of 0.20 dollars per hour for each EKS cluster. This put us on thought but when we compare benefits, this should not be as bad as it sounds. As a user, we designed and deployed multiple applications with different namespaces and VPC ranges on a single cluster

We initiated the process for one cluster, migrated one service, and validated stability on Docker Swarm and Amazon EKS. As the other infrastructure was already on AWS, we found that the Docker Swarm configuration would be time consuming and would require many efforts to monitor and manage.

With EKS, we received support/guidance from Amazon to design and deploy services along with how we can reduce costs hence we decided to go with EKS.

Migrating to Kubernetes from Mesos

For environment creation, mapping, and deployment on EKS we used CloudFormation (YAML) templates.

CloudFormation: AWS CloudFormation provides a customized graphical and YAML based interface to create/manage/modify a larger number of AWS resources and mapping their dependencies. As CloudFormation is a service from AWS, any new service will be available to use.

Options such as Terraform which is open source and supporting major cloud platforms to set up infra as Code are available but we used CloudFormation as we have everything on AWS.

·        How EKS Helped:

o   AWS billing can be reduced by using EKS

o   Less number of EC2 Instances

o   Auto scaling using EKS

o   EKS monitoring services and alerts services

 

·        New Infrastructure:

o   Reduced EC2 Instance from 15 Medium to 3 Large

o   Removed Graphite

o   Autoscaling using EKS

o   Reduced Datadog and Pager duty Alert configuration costing and complexity

o   Prometheus + Grafana based Alert configuration

DATADOG: We configured Datadog with an extension of CloudWatch for monitoring EC2 instances and connected AWS services. Installed the Datadog Agent on instances enabled to collect system-level metrics at 15-seconds for memory, CPU, storage, disk I/O, network, etc.

Prometheus + Grafana: For additional alert and monitoring of the Kubernetes cluster, we configured Prometheus + Grafana.

Prometheus helps with capturing and retaining data of POD, container, systemd services, etc. We can use these data to analyze the stability and behavior of the application and environment.

GRAFANA uses data stored by Prometheus and gives graphical presentations of statistics and alert configuration for easy assessment.

Post Migration Best Practices

·        Maintain MTTR (Mean time to Respond/Resolve)

·        List down Critical conditions and Report

·        Immediate actions

·        Incident Reporting

·        Root cause analysis

·        Continuous improvement in Define Processes

 

Strategy to Achieve:

·        MANUALLY:

·        Perform validation steps on Regular interval

·        Debug when unexpected behavior observed

·        Follow define Steps of Runbook

·        Call or Email Dev Support Team if not resolved in stipulated time

·        Restart services if needed after taking logs of Existing failure

 

·        AUTOMATION UTILITIES:             

·        Continuous execution of define validation tools using Jenkins + Selenium/Dynatrace

·        Enhancing validation steps coverage of Python scripts

·        Notification on Slack channel

·        Pagerduty

 

·        ACTIONS:

·        Email if not resolved within 15 min

·        Escalate to Level 4 if not resolved within one hour

·        Escalate to Level 5 if not resolved

·        Get Environment up and Running

 

·        BEST PRACTICES

·        Observe the environment for a few hours

·        Create a root cause analysis document

·        Get Approval of identified root cause analysis from the Dev team

·        Gather resolution information from Dev Team

·        Gather immediate actions if the same RCA observed in the future to minimize Downtime

·        Update runbook for future reference

Benefits and Applications

·        AWS billing reduced by ~40 % in our case as EC2 count reduced to 3 from 15

·        Auto service restart based on scaling configuration helped in the availability of application

·        Data loss and end customer escalation reduced

·        More advanced way of Monitoring which helped DevOps Engineer to identify root cause quickly

Conclusion

 

When we talk about the conclusion regarding our case, we found EKS was more helpful as we found more stability of our application after changes into orchestration. With EKS, we observed service stability, auto scaling, load balancing which helped us retain product availability. It is also true that both Kubernetes and Mesos provide facilities for application deployment as containers on the cloud, based on different application needs, solutions may vary.

Wednesday, December 30, 2020

Run Selenium Code on Linux using headless Google Chrome and How Install Python 3.7 on Virtual Instance/Amazon EC2 when Python 2.7 already installed

Execute Selenium Code on Cloud Virtual Instance/Amazon EC2 and remove dependencies of Local physical Desktops - with same reference we 

Install Python 3.7 on Virtual Instance/Amazon EC2 when Python 2.7 already installed

Followed following high-level steps. 

  1. Create free tier google Virtual Machine or Amazon EC2 Instance 
  2. Make sure you can access same from your Windows Desktop/Laptop
Please refer
Document: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html 
Video: https://www.youtube.com/watch?v=bi7ow5NGC-U&ab_channel=LinuxAcademy 

Once able to access Cloud instance Please follow below steps. 


Check OS version 
OS version will help to search future references because if we execute Ubuntu commands on CentOS - it will not work 
For Example: APT vs Yum 
Please refer
Document: https://cmdref.net/os/linux/note/centos-vs-rhel 

For our case: we will use below command:
Execute below command/s:
cat /etc/os-release

Once OS confirmed as Centos  - Check Python is installed or not - By default many Linux virtual instance dependent on python and having python 2.7.5 preinstalled.  

Execute below command/s:
python -V or python --version (will give python version as 2.7.5)
Which Python will give location as /usr/bin

Now install python 3.7


Please do not install using - yum install -y python3 - It will install python 3.6*


Please refer: 
Document: https://tecadmin.net/install-python-3-7-on-centos/ 

Now Technical challenges starts 
python --version or python3 --version will not give required version 

Reason is, Default python path still link to Python2.7 
If you will search - Stackoverflow and everyone suggesting to move python location using below commands - 

Try adding 
Execute below command/s:
export PATH=$PATH:/usr/local/bin/python

This will help to make sure python 3 installed 
Execute below command/s:
python3.7 --version

Please do not execute below commands as yum will be stopped. 

mv /usr/bin/python2.7 /usr/bin/python_old
mv /usr/local/bin/python3.7 /usr/bin/python 
sudo ln -s python3.7 python


First Install PIP and required libraries

Execute below command/s:
Using YUM
sudo yum install epel-release
yum -y update
yum -y install python-pip
pip -V
or Using Curl
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
sudo python get-pip.py
pip -V

Move requiremenet.txt on linux server
Sample requirment.txt (copy everything into notepad, save as requirement.txt)

Execute below command/s:
pip install -Ur requirement.txt

astroid==2.4.2
atomicwrites==1.4.0
attrs==20.3.0
cachetools==4.1.1
certifi==2020.11.8
chardet==3.0.4
colorama==0.4.4
dateutils==0.6.12
docutils==0.14
et-xmlfile==1.0.1
google-api-core==1.23.0
google-api-python-client==1.12.8
google-auth==1.23.0
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.2
googleapis-common-protos==1.52.0
httplib2==0.18.1
idna==2.10
importlib-metadata==3.1.0
isort==4.3.21
jdcal==1.4.1
lazy-object-proxy==1.4.3
mccabe==0.6.1
more-itertools==8.6.0
multi-key-dict==2.0.3
numpy==1.19.4
oauth2client==4.1.3
oauthlib==3.1.0
openpyxl==3.0.5
pandas==1.1.4
pbr==5.5.1
pluggy==0.13.1
protobuf==3.14.0
py==1.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pylint==2.3.1
pyserial==3.4
pytest==4.4.0
pytest-html==1.17.0
pytest-metadata==1.11.0
pytest-ordering==0.6
pytest-parallel==0.0.5
python-dateutil==2.8.1
python-jenkins==1.0.0
pytz==2020.4
requests==2.25.0
requests-oauthlib==1.3.0
rsa==4.6
selenium==3.141.0
six==1.15.0
typed-ast==1.4.1
uritemplate==3.0.1
urllib3==1.26.2
wrapt==1.12.1
xlrd==1.2.0
zipp==3.4.0


Now you need to install google chrome and chrome driver on Linux


(As we are using putty) we will not able to launch Chrome and validate UI - we need to validate details using command - mainly Chrome Version: 

Execute below command/s:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum localinstall google-chrome-stable_current_x86_64.rpm
yum info google-chrome-stable

You also need to install chromedriver:

Execute below command/s:
cd /home/seleniumtest (if folder not created, please create)
wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip
unzip chromedriver_linux64.zip

This will unzip chromedriver into seleniumtest  folder

Setup is done so now we can switch to Python3.7.

Execute below command/s:
cd /usr/bin
ls -la | grep "python"
- it will display python2.7 and reference link or python or python2.7 folder with python name and no reference link
- if reference link python -> python2.7 available we need to remove

Execute below command/s:
sudo su
cd /usr/bin
ls -la | grep "python"
rm python (rm: remove symbolic link 'python' ?) will be displayed 
type: yes and press enter
ls -la | grep "python"
Symbolic reference link of python removed 

Execute below command/s:
mv /usr/bin/python2.7(Few might have folder as python) /usr/bin/python_old
mv /usr/local/bin/python3.7 /usr/bin/python 
sudo ln -s python3.7 python

Check python version
python --version - it should be 3.7


Now - Run Selenium Code on Linux using headless Google Chrome

cd /home/seleniumtest  (Same location where chromedriver downloaded)
vi abc.py
Press i and paste code 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--headless')
options.add_argument('--disable-gpu')


driver = webdriver.Chrome(executable_path='./chromedriver', chrome_options=options)
driver.get('https://github.com/')
print(driver.title)
driver.quit()
type :wq to write and save abc.py

Execute: 
python abc.py 
Output - 
GitHub: Where the world builds software . Github 



Few errors which we resolved during this entire setup


Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__ self.service.start() File "/usr/lib/python2.7/site-packages/selenium/webdriver/common/service.py", line 98, in start self.assert_process_still_running() File "/usr/lib/python2.7/site-packages/selenium/webdriver/common/service.py", line 111, in assert_process_still_running % (self.path, return_code) selenium.common.exceptions.WebDriverException: Message: Service /bin/google-chrome unexpectedly exited. Status code was: 1

 Downloading packages: File "/usr/libexec/urlgrabber-ext-down", line 28 except OSError, e: ^ SyntaxError: invalid syntax File "/usr/libexec/urlgrabber-ext-down", line 28 except OSError, e: ^ SyntaxError: invalid syntax

 

import 'genericpath' # <_frozen_importlib_external.SourceFileLoader object at 0x7fe78c85ec10> import 'posixpath' # <_frozen_importlib_external.SourceFileLoader object at 0x7fe78c844dd0>


(Driver info: chromedriver=2.40.565383 ,platform=Linux 3.1


selenium.common.exceptions.WebDriverException:unknown error: DevToolsActivePort file doesn't exist