I just messed up prod

It happened. It finally happened. After almost 4 years working in the IT industry I managed to screw up the production environment.

I was almost starting to worry about my lack of failing stories to tell.

What happened?

It all started when I was working on an application that consumes some messages from an SQS queue that was subscribed to a couple of SNS topics. Unfortunately, we misdesigned the app upfront and we didn’t account for the fact that the app needed to receive and process the messages in order, so we were now dealing with concurrency problems.

After some analysis on the concurrency issue, we decided to change the SQS queue that our app listens to and use a FIFO queue, to maintain the order of messages. The only problem is that FIFO queues can’t be subscribed to regular SNS topics, so we would have to change the nature of two topics as well to fit our requirements. The plan was settled, we’d stop emitting messages to the regular SNS topics and start emitting to the new FIFO ones.

In other words, every subscriber of the regular SNS topics would have to subscribe to the new FIFO topics in order to keep receiving their messages.

And that’s where I fucked up. I forgot to create one of the subscriptions when making the changes, which resulted in approximately 10 hours of missing messages that weren’t processed. That’s about 45K messages I had to dig out from logs, parse and run a script through to fix the state of the system.

I may talk about the solution, which was technically interesting to implement, but that’s not the point here.

Lessons learned

At design time, check the assumptions about input and output, especially if the application needs to keep and change state. I remember to talk with my team about it, but we clearly didn’t pay too much attention to this as we should, so now we’re spending time to fix a problem that shouldn’t exist at all.

If possible, try to have a standard for changing infrastructure and try to do it in code to have a reproducible change in staging and prod environments, this should also decrease the possibility of human mistakes. If you are tied to making important changes by hand, keep a checklist with you. Doesn’t matter how little is the change, it can save you from yourself.

Try not to do important changes alone. An extra pair of eyes will most likely be useful to check if things are working as expected.

Don’t rush things. It is somewhat difficult to control whether to buy into the excitement of finishing something or to be more conservative. When dealing with infrastructure, always be more conservative. Don’t make changes under not enough sleep, don’t make changes under poor nutrition and avoid making changes when you’re overworking. It’s better to ship it late than to cause a big mess to solve the next day.

Shit happens. Going desperate mode will only make the troubleshooting process more prone to errors. Step back and take a deep breath before acting on the problem. If the situation is trully critical then it demands good thinking and we can’t do that if we’re too busy being overwhelmed by it.

The impact on me

This happened in a scenario where our team was going to have some extra benefits if we managed to deliver the product within the current week, so after performing the biggest screw up of my entire carreer in that very same week I was miserable. I felt really guilty. I really wanted my team to earn those benefits, so I just kept working harder, which only led me to produce a sequence of even more shameful mistakes (one involving running an update without a where clause in production, but it was thankfully inside a transaction).

My manager was super cool about it though. He’s been through a plethora of different problems regarding the production environment and managed to calmly assess the situation, focusing on the solution of the problem and its side-effects. He even joked about it, which was crucial for keeping me somewhat productive that day and to keep everyone as calm as possible. Damn I don’t even think the guy thought about that, his experience just took the wheel.

I know my mistakes don’t erase the good things I’ve managed to deliver in the past, but boy does it hurt to fuck things up.

As someone who values himself mostly from their intelligence, accuracy, reliability and performance, I was crushed. I’m always proud of saying that I’m someone who complains, but also someone who delivers, and now I feel like I have less morals to complain about things. I feel this everytime I make a mistake on something, like some of my value was taken from me, like I’m literally less now.

I have to be careful with this feeling and the thoughts it brings me. I can’t let them deny me anything I deserve. I gotta stay calm, not overreact and be concious about what does it mean to make a mistake. Do I really believe that I’m good or part of it is ego?

I’m learning more and more about handling this feeling though, and the more mistakes I make, the more I understand myself and my limits, and thus, the further I can go.