Sonic Drapery

Upon listening to the 9 audio tracks curated by Mary Stewart over and over again I found myself drawn to the interview of Irene Elliot- a crown court clerk. In the extract we received she spoke of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Resolving our platform stability issues

We found that we were continuously fire fighting to keep the system running smoothly. The support team ended up with a giant red button on their floor that was pressed numerous times per week. Our reputation for delivering amazing customer service was being hurt. The teams who had to deal with frustrated customers weren’t able to do their jobs, that is help salons utilise our platform to grow their business.

We needed to take some serious action and fix the stability issues, rethinking our values and decision making process when an incident happened. We could no longer just “redeploy it” or “make x bigger” to solve any production issues.

Every incident that occurred needed to have an outage report. This allowed us to have clear actions and solutions for every type of incident.

Description

Digestible one-liner of what happened.

Outage time

hh:mm

Number of support tickets raised

Detailed numbers of impact on the support team

Affected functionality

Description of the functions of the system affected by the outage

Explanation of the problem

A clear technical description of what happened

The report ensures we have a clear understanding of what actually happened.

Investigations

Some details of where the engineer looked and how they came to fix the issue and how long it took. Along with some screenshots of metrics or logs from the issue.

Preventative measures and actions

What are we going to do from stopping this from happening again?

The minimum expectation here would be an alert to help us pre-empt the issue. Each action needed to be tracked in Jira.

When analysing the data we were clearly able to see how much it was hindering our product development. Engineers were being constantly pulled from different angles to firefight. That instability in velocity and delivery meant we couldn’t accurately predict when new features or improvements could be delivered. Two of our core values are growth and thinking long term, so we knew it was time to fix these issues and evolve our platform.

The effort to fix everything was too large with a small engineering team while also continuing product development work. We had to make a big decision to halt all product development work and undergo a large price of engineering effort to fix the problems. This had large knock-on effects as we had business commitments made and expectations to meet.

The goal was clear, to improve the stability of our system while helping it scale as we grow our customer base. We called this engineering effort project Darwin as it was about the evolution of our system. From an engineering side it was extremely difficult to know when we would be done, but we broke it down into small measurable increments.

Some of the major pieces of work we took on were:

While it was painful to stop feature development and fix the issues, we can safely say our stability problems are gone. There is no more firefighting and the red button on the support floor is thankfully gathering dust. By using our long term values as guidance, we took on project Darwin to attain platform stability, fault tolerance and elasticity.

So that we never have to fall back into this big bang approach of needing to fix things we have adapted a continuous improvement mindset, it is now something that is a core part of our engineering values. We take periodical breaks in our development sprints to work on our technical backlog — fixing niggling issues, upgrading areas of the system, answering the unknowns and always making the system better.

On a more personal note, this was one of the hardest engineering challenges I have ever faced, it wouldn’t have been possible without the talented engineers and support of the team at Phorest.

Add a comment

Related posts:

Blog is migrated to Medium

Blog is migrated to Medium. We moved Speedle blog to Medium for some reasons. I just copied all posts to the platform. Sorry for the inconvenience..

How to Deploy Django Channels 2.x on AWS Elastic Beanstalk

If you have been frantically scouring the web for explanations and solutions on any of these subjects: However, if you are looking for a beginner-friendly tutorial on how to set it all up, this might…

Deepfake Software Startups That are Commercializing the Technology

Deepfake software startups are scrambling to find ways we can use the tech to revolutionize industries including gaming, entertainment, sales and marketing