Sauce

So in order to get closer to the 500 word count requirement of this blog I’ll give you all a rundown of how I’m feeling as I’m writing up this piece. It’s 1am at night and I’m feeling annoyed because…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How we do Alerting and Escalation

When I first joined Hippo in late 2019, if there was an outage from one of our systems then our engineering team would normally hear about it from our production support team. One of our agents would be using the app and it would stop working or they would discover a bug. This led to longer lead times to fix and less trust in our in-house developed systems.

Last year (2021) we were finally in a position to automate our outage detection in a way that would meaningfully benefit our business. We implemented a set of organization-wide alerts that could notify our engineering on-call through PagerDuty alerts to cell phones or through a Slack channel. We setup an alerts channel and defined a small number of P0 and P1 alerts to tell us if something was wrong before one of our users detected it.

It was imperative that we kept the number of alerts small, in order to build trust in alerts and monitoring and not have too much noise. We chose the number for organization-wide P0/P1 alerts to be 10 important alerts of each severity. Teams could still implement their own alerts for team-level alerts if they wanted to watch a service or system more closely.

Now we detect almost every technical problem our outage long before our business does. We often are able to adjust and take care of them quickly. This is the story of what we did and how we did it.

In order to alert our team about problems that occur with our systems, we need some way to monitor them real-time. We had been using a log ingestion system for a while for our logs, but the purpose of logs and the purpose of real-time monitoring are different. Logs are there to provide us detailed information about problems. Some log management products also provide aggregation and alerting but as our system grew we outgrew the ability for the logging system alone to monitor our services.

We started to use Prometheus for real-time metrics. Prometheus provides a time-series database that makes it easy to query rates and trends. Most HTTP services expose an endpoint for Prometheus to scrape, collecting metrics using a “pull” method. Many Prometheus libraries and plugins provide default metrics for HTTP services. They also make it easy to create or instrument metrics for events.

For example, let’s say we want to capture an event every time a customer performs a certain action on our site, like getting a quote for insurance. We have a back end API that issues a quote. We could instrument that back end API to have a quoted event that’s recorded in real time through Prometheus. The pseudo-code would look something like:

One more thing to note is that our real-time metrics system is not meant to store data forever. We set a certain amount of storage space to match roughly 14 or 30 day windows and when the system runs out of space, older events are deleted. Some organizations might require a longer window to retain metrics, so your retention window could vary.

Below is a simple summary of our monitoring and alerting systems. We have real time alerts as described above. We also use a system for business alerting, that allows us to alert us if business metrics change by using a query from our data warehouse. The two combined together let us watch both our distributed system and the health of the business at once:

A simplified summary of our monitoring and alerting at Hippo

The first thing we needed to do was figure out what we wanted to monitor. Some of this was easy, since for every service we run — internal or external — we want to know:

Most scalable, well constructed APIs should have an error rate of 0.1% or less, depending on what kind of dependencies they have. I believe it’s OK to start with a rate much higher than that for alert thresholds and bring them down over time. For latency I like to use the guideline of no more than 1 second for 95p but latency can be much more variable than error rate, depending on what your service does.

Some more examples of metrics we might watch:

I’ll introduce a very simple example of tuning an error rate alert. Your service and error rate may vary. You also want to probably gather a longer history of data than I’m using for this example. Let’s say we have a service error rate graph over the last 12 hours that looks like this:

Sample service error rate graph

We can draw a few conclusions to tune an alert:

We normally have this alert post to a Slack channel, then watch it for a few days. If it fires false alarms then we consider either fixing the underlying service or changing the alert threshold higher if that can’t be done.

If we’re not seeing it fire at all and we’re not satisfied with the service performance, we may adjust the thresholds of both rate and time down a little to detect outages more consistently.

Sometimes we have a service that usually has zero or very low error rate. If we see any spike in error rate it could indicate an outage. The example below has a short, high spike in error rate. Since it lasts only a few minutes, we probably want to ignore it as a blip or network hiccup and set the threshold longer than the spike:

When we have an automated alert fire we had to define an escalation process. The basic idea of an escalation process is to have a script that one of our engineers on call can follow to notify the right people of a problem and fix the problem as quickly as possible. Sometimes when an unfamiliar issue pops up with our system, it’s easy to panic and not know who to contact next for help. The escalation process removes that doubt and gives clarity to our on-call.

We setup a simple set of steps to escalate:

These simple steps give the on-call instructions on what to do when an alert shows up. They don’t need to know how to fix it, only what to do when they see it and how to contact someone to help fix it. Our team resolves most issues within a couple of hours.

We have other paths of escalation, such as user complaints about problems with the system, which get injected into this process in other ways but are probably beyond the scope of this short post.

Not every organization has a 150-strong engineering team. Maybe you don’t even have 25. What can you do with a smaller team to achieve some of the same results?

Most log management systems have an alerting feature or function. With a small volume of traffic, it’s easy to setup alerts and thresholds all in one place. We use Loggly for processing logs, which has alerting functions that let you configure alerts to e-mail or Slack channels. That’s perfectly adequate for a 10 person team to set up some basic alerting. It’s also not that difficult to setup a Prometheus/Grafana combo and allow metric scraping from a running service. A good dev/devops team can prototype this in less than a day.

Add a comment

Related posts:

What Will v2ray ssh Be Like in 100 Years?

If you are applying remote terminals, SSH may be the protocol of preference. However, if you want to use a browser for authentic-time chat, WebSocket is more proper. SSH is safer and can be used for…

The Evil That Men Do Lives After Them

Climate leadership in the real world means direct collisions not only with your entrenched beliefs and denials, but also with entrenched financial interests and greed of unimaginable scope and…

What are Lead Magnets?

Lead magnets are gated content pieces used for acquiring leads. Know more about lead magnets here.