Leveraging an outage to build community and consensus

We had our first extended outage at RadioSpiral this weekend, and I’m writing about it here to point out how a production incident can help bring a team together not only technically, but as a group.

The timeline

On Sunday evening, about an hour before Tony Gerber’s Sunday show, RadioSpiral went offline. Normally, the AirTime installation handles playing tracks from the station library when there’s no show, and it played a track…and then stopped. Dead air.

The station has been growing; we’ve added two new DJs, doubling the number of folks who are familiar with servers, Linux, etc. Everyone who was available (pretty much everyone but our primary sysadmin, who set everything up and who is in the UK) jumped in to try to see what was up. We were able to log in to AirTime and see that it was offline, but not why; we tried restarting the streaming service, and the server itself, but couldn’t get back online.

We did figure out that we could connect to the master streaming port so that Tony could do his show, but after that, we were off the air for almost 12 hours, until our primary sysadmin was up, awake, and had finished his work day.

A couple hours of investigation on his part did finally determine that LetsEncrypt had added a RewriteRule to the Airtime configuration that forced all URLs to HTTPS; unfortunately it needs HTTP for its internal APIs and that switchover broke it. Removing the rule and restarting the server got us back on line, and our very patient and faithful listeners trickled back in over the day.

Now what?

While we’d not been able to diagnose and fix the problem, we had been chatting in the staff channel on the RadioSpiral Discord server, and considering the larger issues.

RadioSpiral is expected to be up 24/7, but we’re really running it more like a hobby than a business. This is reasonable, because it’s definitely not making any of us money, at least not directly. (Things like sales of albums by our DJs, etc., are their business and not part of the station’s remit.) This means that we can have situations like this one, where the station could be offline for an extended amount of time without recourse.

Secondarily, RadioSpiral is small. We have three folks who are the core of actual station operations, and their contributions are very much siloed. If something should happen to any one of the three of us, it would currently be a scramble to replace them and could possibly end up with an extended loss of that function, whether broadcast operations, the website, or community outreach and the app.

So we started looking at this situation, and figuring out who currently owned what, and how we could start fixing the single points of failure:

  • Station operations are on an ancient Linux release
  • We’re running an unsupported and unmaintained version of Airtime. It can’t even properly reset passwords, a major problem in an outage if someone can’t get in.
  • The MacOS/iOS app is handled by one developer; if that person becomes unavailable, the app could end up deleted from the store if it’s not maintained.
  • The website is being managed by one person, and that person becomes unavailable…well, the site will probably be fine until the next time the hosting bill isn’t paid, but if there were any issues, we’d be SOL.
  • We do have documentation, but we don’t have playbooks or process for problem solving.
  • We don’t have anywhere that is a gathering point when there’s a problem.
  • We don’t have project tracking so we can know who’s doing what, who their backup or backups are, and where things are in process.
  • We don’t have an easily-maintained central repository of documentation.

What we’re doing

I took point on starting to get this all organized. Fixing all of the things above is going to take time and some sustained effort to accomplish, and we’re going to want to make sure that we have everything properly set up so that we minimize the number of failure points. Having everyone onboard is critical.

  • We’re going to move operations to a newer, faster, and cheaper server running a current LTS Ubuntu.
  • We’re going to upgrade from the old unsupported AirTime to the community-supported LibreTime.
  • We’re figuring out who could get up to speed on MacOS/iOS development and be ready to take over the app if something should happen that I couldn’t continue maintaining it. At the moment, we’re looking at setting up a process to bump the build number, rebuild with the most current Xcode, and re-release every six months or so to keep the app refreshed. Long-term we’ll need a second developer (at least) who can build and release the app, and hopefully maintain it.
  • We haven’t yet discussed what to do about the website; it is already a managed WordPress installation, so it should be possible to add one or more additional maintainers.
  • We are going to need to collect the docs we have somewhere that they can be maintained more easily. This could be in a shared set of Google docs, or a wiki; we’re currently leaning toward a wiki.
  • We need project tracking; there’s no need for a full-up ticketing process, at least yet. We think that Trello should do well enough for us.

We have set up some new Discord channels to keep this conversation open: #production-incidents, to make tracking any new problems easier, and #the-great-migration, to keep channels open as we move forward in the migration to our new setup.

Everyone is on board and enthusiastic about getting things in better shape, which is the best one could want. It looks good for RadioSpiral’s future. Admittedly we should have done this before a failure, but we’re getting it in gear, and that’s better than ignoring it!

Reply