📋 What You Need Before Starting

Make sure these are ready:

  • Incident.io Setup: For managing incidents.
  • Grafana & Loki: For checking logs and errors.
  • Checkly Debugging: For testing and monitoring.

🚨 Stay Calm and Take Action

Don’t panic! Follow these steps to fix the issue.

  1. Tell Your Users:

    • Let your users know there’s an issue. Post on Community and Discord.
    • Example message: “We’re looking into a problem with our services. Thanks for your patience!”
  2. Find Out What’s Wrong:

    • Gather details. What’s not working? When did it start?
  3. Update the Status Page:

    • Use Incident.io to update the status page. Set it to “Investigating” or “Partial Outage”.

🔍 Check for Infrastructure Problems

  1. Look at DigitalOcean:
    • Check if the CPU, memory, or disk usage is too high.
    • If it is:
      • Increase the machine size temporarily to fix the issue.
      • Keep looking for the root cause.

📜 Check Logs and Errors

  1. Use Grafana & Loki:

    • Search for recent errors in the logs.
    • Look for anything unusual or repeating.
  2. Check Sentry:

    • Look for grouped errors (errors that happen a lot).
    • Try to reproduce the error and fix it if possible.

🛠️ Debugging with Checkly

  1. Check Checkly Logs:
    • Watch the video recordings of failed checks to see what went wrong.
    • If the issue is a timeout, it might mean there’s a bigger performance problem.
    • If it’s an E2E test failure due to UI changes, it’s likely not urgent.
      • Fix the test and the issue will go away.

🚨 When Should You Ask for Help?

Ask for help right away if:

  • Flows are failing.
  • The whole platform is down.
  • There’s a lot of data loss or corruption.
  • You’re not sure what is causing the issue.
  • You’ve spent more than 5 minutes and still don’t know what’s wrong.

💡 How to Ask for Help:

  • Use Incident.io to create a critical alert.
  • Go to the Slack incident channel and escalate the issue to the engineering team.

If you’re unsure, ask for help! It’s better to be safe than sorry.


💡 Helpful Tips

  1. Stay Organized:

    • Keep a list of steps to follow during downtime.
    • Write down everything you do so you can refer to it later.
  2. Communicate Clearly:

    • Keep your team and users updated.
    • Use simple language in your updates.
  3. Take Care of Yourself:

    • If you feel stressed, take a short break. Grab a coffee ☕, take a deep breath, and tackle the problem step by step.