As all companies become software driven, DevOps is becoming an important practice in enterprises and startups across the world.
DevOps is about bringing velocity to delivering tech products and services, so you can delight customers and meet business goals. To achieve this velocity, development (dev) and operations (ops) teams work closely together across the software lifecycle - from planning to release.
And this has led to a new role in engineering teams - DevOps Engineer.
To understand what goes into the day-to-day for the role, we analyzed 29 Devops Engineer job postings from major tech companies like Apple, Tiktok, Airbnb and more to find out. Here are the top responsibilities of a DevOps Engineer and the # of job postings that mention it.
- Manage Infrastructure (79% of job postings)
- Build and Maintain the CI/CD Pipeline (69% of job postings)
- Availability and Reliability of Services (34% of job postings)
- Security and Compliance (43% of job postings)
- Monitoring and Alerts (37% of job postings)
- Incident Management and On-call (31% of job postings)
- Production Troubleshooting (20% of job postings)
- Automation and Tools (41% of job postings)
- Consult the Engineering Team (31% of job postings)
As a DevOps engineer, one of your core responsibilities is owning the tech infrastructure behind all products, services, apps, APIs and more. You will be involved in the design and delivery of staging and production environments that meet the performance and reliability requirements of the services deployed on the infrastructure.
You will use automation and “infrastructure as code” to achieve this at scale, and should be comfortable using the popular tools like Terraform and Salt stack. This essay from Emily Wood is a good read on the topic.
Build and Maintain the CI/CD Pipeline
Along with the infrastructure, building and managing the CI/CD pipeline will be your key responsibility. CI/CD consists of continuous integration, continuous delivery, and continuous deployment of code and was created to solve the problems of integrating new code into existing systems. These problems are solved by introducing automation and monitoring into this part of the software lifecycle, and the steps together are referred to as the CI/CD pipeline. The main pipeline stages are Build, Test, Release, Deploy and Validation.
You will use popular tools like CircleCI, Semaphore CI, Travis CI (and many others) to stay on top of this task. You can dive deeper into continuous delivery with this essay.
Availability and Reliability of Services
As part of the DevOps team, you will be tasked with maintaining the availability and reliability for all the tech services. Some of the activities as part of this are listed below -
- You will hold discussions with the product team to understand the availability goals of a feature, product or service and do demand forecasting and capacity planning to achieve these goals
- You may be involved in creating a disaster recovery plan and test it so you can bring services back online after a major outage event
- You may be in charge of critical services which have a high availability requirement (with close to zero downtime). In that case, you will have to make sure to check off items for high availability.
- You may also be required to define and maintain SLO, SLI and SLA to meet the availability and reliability requirements.
Security and Compliance
Securing the important data of your business and users will also partly become your responsibility. You will be tasked with following the security best practices for the infrastructure and development. This may include things like running security audits and managing user roles on cloud platforms to restrict access.
If your company has certifications like SOC 2 or ISO 27001, your team will have to meet the requirements to maintain those certifications. In addition to this, your team may have to drive privacy, compliance and security initiatives within the organization.
Monitoring and Alerts
To achieve the availability goals and honor SLAs, DevOps engineers need to continuously monitor the tech infrastructure, apps, APIs etc. Depending on the complexity of requirements, you can choose between commercially available or open-source tools, and in some cases can build the tools yourself.
You should use the four golden signals of monitoring to establish a baseline and keep the systems healthy.
- Latency - The time it takes the system to respond to and serve a request. Obviously, the quicker the system can identify errors, the better.
- Traffic - The more demand that’s put on the system, the more stressed out it becomes. You should be aware of how much load their system can handle.
- Errors - There will always be errors. However, both the frequency and seriousness of errors should be monitored and evaluated to see if there’s a bigger issue to address.
- Saturation - How much can the system handle? How much traffic is too much? If the system’s resources are maxed out, the service will suffer.
There are many popular monitoring tools like Datadog, Prometheus and Grafana. To know more about metrics, you can read this article from DigitalOcean.
Incident Management and On-call
As DevOps engineers, you are sometimes asked to go on-call as first responders. In this capacity, you’ll address any tech issues that arrive during their shift and either resolve or escalate them to the larger team. You will have to run incident drills, set up on-call schedules for teams, and educate them on processes for handling critical incidents.
After an incident, you may be involved in writing the post-mortem as well. Traditional post-mortem practices involve sharing the first draft internally for peer review and having it proofread by senior engineers. It’s important to collect all key incident data, assess the impact the error had on the system, identify the root cause, and prioritize the fixes appropriately. Here is an example of a detailed post-mortem from the AWS team.
You can check out Spike.sh or Pagerduty for incident management and on-call.
As with the compiling of a post-mortem, you will also collaborate with peers to troubleshoot production issues, especially when the issues are related to performance, reliability, and scale.
Automation and Tools
Engineering teams are constantly working with limited resources to solve big problems, so automation plays a pivotal role for a DevOps team to help achieve the engineering organization’s goals. You will have to find bottlenecks across the development lifecycle and automate the tasks to improve consistency and free up dev teams to focus on shipping product features. You will also have to build self-serve tools for engineering teams to get their DevOps related requests satisfied faster. This is a good starting point for automation in DevOps.
Consult the Engineering Team
And finally, in your role as a DevOps engineer, you will be an evangelist for best practices across all engineering functions like development, QA and ops. You will guide others in using DevOps principles like automation and reuse in their workflows.
In order to offer solid and actionable advice, you should keep up to date with tech trends, cutting edge tools and modern practices. You should be plugged into the community to learn from your peers and share lessons from your experiences. Ultimately, your knowledge will prove indispensable when making decisions about key technologies, next-generation processes, and adopting DevOps-first principles.
The mandate of ensuring that tech systems are running smoothly and securely involves many different tasks. A DevOps engineer acts as a jack-of-all-trades, maintaining enough knowledge to dip into various parts of the system and ensure that everything is working as well as it can be.Beyond keeping your system running like a well-oiled machine, engineers should constantly be looking for ways to make it run even better.
- We did a similar post by analyzing SRE roles at 30 major companies - you can read it here.
- A giant visual roadmap for DevOps.
- We analyzed job postings from Apple, Microsoft, Tiktok, Slack, Zoom, Dropbox, Airtable, Paypal, Ripple, Drizly, Demandbase, OneLogin, Foursquare, App Annie, Roku, Seatgeek, Box, Braze, ChowNow, Wealthfront, ClickFunnels, Mastercard, Crowdstrike, TaskRabbit, Drift, Sensor Tower, CloudBees, Flexport and Sprig.