Engineering Onboarding
Handling Downtime
📋 What You Need Before Starting
Make sure these are ready:
- Incident.io Setup: For managing incidents.
- Grafana & Loki: For checking logs and errors.
- Checkly Debugging: For testing and monitoring.
🚨 Stay Calm and Take Action
Don’t panic! Follow these steps to fix the issue.
-
Tell Your Users:
- Let your users know there’s an issue. Post on Community and Discord.
- Example message: “We’re looking into a problem with our services. Thanks for your patience!”
-
Find Out What’s Wrong:
- Gather details. What’s not working? When did it start?
-
Update the Status Page:
- Use Incident.io to update the status page. Set it to “Investigating” or “Partial Outage”.
🔍 Check for Infrastructure Problems
- Look at DigitalOcean:
- Check if the CPU, memory, or disk usage is too high.
- If it is:
- Increase the machine size temporarily to fix the issue.
- Keep looking for the root cause.
📜 Check Logs and Errors
-
Use Grafana & Loki:
- Search for recent errors in the logs.
- Look for anything unusual or repeating.
-
Check Sentry:
- Look for grouped errors (errors that happen a lot).
- Try to reproduce the error and fix it if possible.
🛠️ Debugging with Checkly
- Check Checkly Logs:
- Watch the video recordings of failed checks to see what went wrong.
- If the issue is a timeout, it might mean there’s a bigger performance problem.
- If it’s an E2E test failure due to UI changes, it’s likely not urgent.
- Fix the test and the issue will go away.
🚨 When Should You Ask for Help?
Ask for help right away if:
- Flows are failing.
- The whole platform is down.
- There’s a lot of data loss or corruption.
- You’re not sure what is causing the issue.
- You’ve spent more than 5 minutes and still don’t know what’s wrong.
💡 How to Ask for Help:
- Use Incident.io to create a critical alert.
- Go to the Slack incident channel and escalate the issue to the engineering team.
If you’re unsure, ask for help! It’s better to be safe than sorry.
💡 Helpful Tips
-
Stay Organized:
- Keep a list of steps to follow during downtime.
- Write down everything you do so you can refer to it later.
-
Communicate Clearly:
- Keep your team and users updated.
- Use simple language in your updates.
-
Take Care of Yourself:
- If you feel stressed, take a short break. Grab a coffee ☕, take a deep breath, and tackle the problem step by step.