In November 2022, we had decided to split the database which our services write into. We initially had everything one just one service, and that was a problem. If, for example, that central database went down, it would affect everything. If we had a slow query on a database, automatically, everybody was affected, and it wasn’t a good experience for our users.
So we took that insight and decided that we had to split our services. In the course of this splitting, we deployed at night because there was less traffic, and at that time, everything looked fine. But the next morning, by 5:30 am, there was a problem.
The scale of our products at Moniepoint and the delicacy of financial services means that there is little to no room for error. Keeping the trust of our users, merchants who depend on us for a complete business banking experience, requires that our products and services remain active. But the world isn’t perfect, and that’s where site reliability engineering comes in.
What is Site Reliability?
Simply put, Site reliability is the discipline or structure put in place that acts as a supporting system for software, applications or products.
Site reliability is what happens when you ask a technical team to design an operations team. This means that, unlike traditional technical engineers, devs and other tech-related roles, site reliability engineers (SREs) apply some software principles to their day-to-day work to ensure the proper flow of our services.
In the incident from November, our SRE team helped identify the problem we had early on. The person on duty noticed that we started having slow queries on the database. We saw the slow query, and connected it to the deployment the night before.
In our language, we call it “deploying beans”. If you’ve ever had to prepare beans, then you know that cooking beans is very difficult, and it doesn’t always come out right. Apparently, what happened was that when the new splitting was done, they missed out on a vital column.
So we all had to get on a call - site reliability with enterprise architects and the head of infrastructure. We resolved that issue before 7:30 am that morning.
There are four primary responsibilities of our Site Reliability Engineering team;
Monitoring: A foundational requirement for every SRE, monitoring involves collecting, processing, aggregating, and displaying real-time quantitative data about a system. This could include query counts and types, error counts and types, processing times, server lifetimes and other things that need to be monitored. We use tools to observe for any abnormalities in the system, so we can get ahead of them before they disrupt our services. Active monitoring was why we were able to detect the problem with our database, and get ahead of it before it began to affect our users.
Availability: We are responsible for the availability of the services we support. After all, if services are unavailable, end users are disrupted, which can cause serious damage to our organisation's credibility. We ensure that every service we render remains available.
Performance: Our service needs to be not only available, but also highly performant. For example, how useful is a website that takes 20 seconds to move from one page to another or a transfer that takes 1 week to go through? Beyond just making our services available, they must also perform at a level that provides a seamless experience for our users.
Incident management: SREs manage the response to unplanned disruptions that impact customers, such as outages, service degradation, or interruptions to business operations. We identify the problem, work to ensure that the services are back up and running, and coordinate with the necessary stakeholders to make this happen.
How do we maintain site reliability?
The SRE role is a diverse one, with many responsibilities. An SRE must be able to identify an issue quickly, troubleshoot, and mitigate it with minimal disruption to operations.
A partial list of the tasks we typically undertake includes;
Working with the Devs, Enterprise Architect, Infrastructure or System Admin to solve problems using software - whether it’s code-related, a quick fix or infrastructural upgrade.
Being on call to detect any service abnormalities. This is not the most attractive part of being an SRE, but it is essential. We must be available to attend to errors, even outside of typical work hours.
We facilitate discussions of strategy and execution during incident management. We call this “leading a war room”.
Performing postmortems to identify processes that can be put in place to avoid further disruption in the future. For every issue we encounter, we fix them and then put in systems to ensure that they never occur again.
Automating: SRE can be tedious and boring, if everything is done manually. Automation not only saves time but reduces failures due to human errors. Spending some time working on automating tasks can have a strong return on investment.
Stand-ups to discuss and Implement best practices, We are in several areas of service management.
How we designed an effective on-call system
An on-call management system streamlines the process of adding members of the SRE team into after-hours or weekend call schedules, assigning them equitable responsibility for managing alerts outside of traditional work hours or on holidays.
In some cases, an organisation might designate on-call SREs around the clock. In our case, we have a team of application monitoring and technical engineers available before working hours, during working hours and late nights, who are the first touch when issues arise, i.e. we use on-call schedules to make sure that someone's always there to respond to major bugs, capacity issues, or product downtime.
If they can't fix the problem on their own, they're also responsible for escalating the issue. For SRE teams like ours, who run services for which customers expect 24/7/365, 99.999% uptime and availability, the on-call SRE has multiple duties:
Protecting production systems: The SRE on call is a guardian to all production services they are required to support.
Responding to emergencies within acceptable time: we have automated monitoring and alerting solutions that also empower our SREs to respond immediately to any interruptions to service availability.
Involving team members and escalating issues: The on-call SRE is responsible for identifying and calling in the right team members to address specific problems.
Tackling non-emergent issues: we have secondary on-call engineers scheduled to handle non-emergencies.
Having an on-call system ensures that you’re able to detect problems at off-hours, and this is especially important for financial services. Having a remote SRE team also increases our efficiency. We’re available round the clock to attend to issues and are able to gather on a call and fix issues, no matter where we are.
An SRE team you can rely on
Businesses sometimes deprioritise site reliability, leaving the load for just one person to handle. But the advantage of having a team, instead of just one person, is that there’s always someone available to detect any issues that might arise.
With our transactions per minute hitting new milestones every day, site reliability has become more important for us than ever. Our team grew, and our system made everyone’s jobs clear and fluid. Part of ensuring that we run effectively is a culture that ensures that our team can run efficiently. For our teams, we prioritise;
A focus on engineering
Balanced workload to avoid burnout
Positive and safe environment
Beyond this, every member of the team has a couple of characteristics that contribute to effective site reliability engineering. Some of them include;
Attention to detail: Our team refers to this as being paranoid. It could be just a small spike and something someone else could ignore, but finding out exactly what is happening (and why) is important. If there’s a spike or reduction in the values you’re monitoring, what caused it? Is it a normal pattern, or is something wrong? That paranoia helps a lot. We’ve sometimes detected bugs just by being attentive to changes in our system that might’ve otherwise been ignored.
Willingness to learn: A site reliability engineer doesn't need to know everything, but you should be close enough. This is important because you’ll need to interact with different teams, to solve different problems that might arise. So you want to be versatile enough to interact with them, and find a solution that works.
Taking responsibility: As a site reliability engineer, it’s important to understand that it’s your responsibility to ensure that services stay up. This sense of responsibility keeps you attentive to processes, and with good attention to detail, keeps you ahead of downtimes.