In this video Alexis Moussine-Pouchkine, Developer Advocate at Google’s Cloud organization, walks through SRE (Site Reliability Engineering) best practices and how they work with regard to a Serverless architecture.
At 1:02 he begins by explaining Serverless, meaning the absence of server management incorporating fully managed security, that it is a ‘pay only for usage’ model.
Serverless Compute on Google Cloud
At 5:46 Alexis elaborates on the Cloud functions and Serverless Compute on Google Cloud. The first one is cloud functions which act as a service where you can deploy a short amount of code, dependencies and the trigger events.
From 6:10 he explains that the App Engine is more suited for front-end applications with multiple services and modules. With the flexibility and freedom of the container-based applications, you can choose your own stack and utilize the Serverless benefits as well.
At 10:25 Alexis points out the Cloud console screen where he demonstrates Serverless Compute on the Google Cloud. He shows the execution time of the different programs and other comparative details present on the log. At 17:59 he states that SRE’s are responsible for the on-calls, blameless post-mortems, incident management, testing and actionable alerting.
Managing the Error Budget
From 20:50 Alexis defines the critical dynamic central to SRE, the “error budget”.
This is calculated by subtracting the Service Level Objective (SLO) from 100% availability, highlighting that what many may find surprising is Google does not set out to achieve this, as no new feature innovation would be possible.
Therefore optimal SRE is a process of managing this budget, finding the balance between never using it, meaning you aren’t innovating enough, and burning through it meaning you’re doing too much at the expense of reliability.
From 24:05 he demonstrates the several Stackdriver tools across the various cloud functionalities, the App Engine and also the new Serverless offerings. He further elaborates on the idea of logging, traces, error reporting and how these get performed in a Serverless world.
Managing the SLO’s at Datadog
At 28:16 he introduces Daniel Langer, Product Manager at Datadog, highlighting that Datadog is a SaaS platform with over 7500 customers.
At 34:04 David points out how the widget to be launched calculates the Datadog Uptime which helps in tracking the SLO’s. In addition, he also explains how the application latency is taken as an example and shown how the service level indicator works in this regard.
At 43:49 Alexis briefs on how to successfully implement the DevOps for the serverless workloads. He further elaborates on how the user-facing functionalities happen, followed by the data storage and the event-driven logic.
At 44:36, he explains the summarized steps to implement the DevOps in the Serverless scenario. The first step is to have a complete focus on the user and also precisely define the SLIs and the SLO’s while you move on to serverless.
The second step is to use the best available Google Cloud Serverless tools. Since the functions help in the implementation, it is significant to use them wisely. The Serverless environment is a brilliant opportunity to have a steady focus on the user’s needs instead of considering only the lower level of infrastructure.
With the proper SRE practices, error budgets and the Stackdriver tools, you can implement DevOps for Serverless workloads easily.