Tech debt in the cloud

Published in

rude mechanicals

8 min readJan 19, 2018

“We have too much tech debt.”
“That code is full of tech debt”.

We talk about “technical debt” — tech debt — as a single, unitary thing.

In fact, cloud software development today is a fight against four types of tech debt (at least), and they’re quite different: some debilitate, others merely annoy. We should treat them as separate things, rather than use the blanket generalization of “tech debt”.

Second, over the course of writing this post, I realized the term “tech debt” has run its course. In addition to being a “false aggregate” (talking about dissimilar things as a group), “tech debt” isn’t really “debt” at all; it’s a poor attempt at optimization, enabled by bad measurement systems (emphasis on delivered features), poor engineering practices, and upstream indecision in the development train. And its costs: lost productivity, longer time to market, and reduced morale/retention, are rarely worth it.

Type 1: Poor line-level code quality/“sloppy code”

Sloppy code is the canonical form of “tech debt”; it’s sometimes called “code debt”. Code debt is concrete thing experienced developers can easily call to mind. We’ve all seen it, but to clarify, I include:

Poor naming: files, modules, methods, variables; names like “Process”, “Data”, “Execute”, “DoThing”. My personal favorite: “data1”, “data2”, and “data4” (what happened to “data3”?)
Inconsistent style: sloppy bracing, parentheses, etc
Poor balance: single files with double-digit percentages of a system’s code, classes with one long, enormous method, single methods with lots of branches spanning hundreds of lines
Control flow problems: Deep nesting, high cyclomatic complexity (root cause of a serious SSL bug shipped by Apple)

Overall, this is easy-fix territory: open the editor, and have at it (no magic here). Use refactoring tools (e.g. Resharper) if you can; if necessary, purchase them. Observation: I frequently find defects (boundary case handling comes to mind) when cleaning up sloppy code. “Cute” names like “DoThing” are indicators the author didn’t know what they were trying to do. Be vigilant.

While we’re at it: spend time getting names right. Code is read a lot more than it’s written; optimize for reading.

Mandatory code review prevents a lot of this by forcing code to be evaluated for both functionality, and readability, prior to merge. It also instills good habits, because people are always on better behavior when others are watching. Getting code review right is a real art, but for starters: (a) focus on the work, not the person, (b) use reviews to transmit knowledge about other areas of the code, language features, and other 1% improvements (make the team better), and (c) make reviews a tiny bit adversarial, enough to shake out “gotchas”, but stopping shy of inducing dread, or hostility.

A thought: huge arguments happen when small arguments fester. Small things don’t fester when they’re addressed in reviews.

Type 2: Test debt

It’s hard to know how much testing a project needs. Medical, financial, and nuclear plant control systems need a lot. Platforms need a lot. Less interaction, higher up in the stack (fewer dependencies), or things that won’t last long, might need less.

However much is needed, most codebases I’ve worked on don’t have enough testing. That happens because the team doesn’t appreciate the golden rule of tests:

Over time, good tests accelerate development.

Good tests make change easier, catch problems before deployment, and provide a first line of defense against unknown unknowns.

Teams pick up test debt because:

Many (most?) developers don’t know how to write them. It’s one of the industry’s dirty secrets: everyone loves to talk about tests, few know how, actually, to write good tests
Developers are “given credit” for shipping features/story points without tests. Good, hands-on dev managers know better.
Tests require ongoing investment. If there’s no coverage in a codebase, getting started can become “an impossible task”: setting up automation, building code doubles for everything, making the doubles injectable — so big it never gets done. If you find yourself here, crack open a copy of Feathers and remember how he defines legacy (bad) code: “code not under test”.

Fix it by:

Measuring and displaying code coverage on a monitor. Coverage measurements aren’t everything, but they’re a start.
Ensuring that reviewers reject code without tests.
Prioritizing coverage in the engineering culture. What usually goes wrong here is first-level management, almost always freshly-promoted, gets too divorced from day-to-day reality, and loses control of the development process. First-level management must include the quality of work product as a performance-review item, and that should include test coverage. The politics of getting all of this right are tricky; issues like this separate the wheat from the chaff of engineering management. (If first-line management isn’t former developers, give up.)

Type 3: Ops debt

If you’re in the 1% of developers making software without cloud services, congratulations; (a) you aren’t long for this world and (b) you don’t have to wrangle operations — the wonderful world of keeping the product up and running.

I think Google was the first company to apply a software engineering mindset toward infrastructure. They invented a role called Site Reliability Engineer — SRE — an expert at running something like Google. And when you handle their load — pagers going off because storage systems only have a few petabytes of capacity remaining — automation is the only game in town.

Ops debt is the most varied. Examples:

Deployments don’t work: deployments are scary things that take a long time, cause outages, and don’t reliably deliver the deployed version of software to the production environment
Things aren’t monitored; you don’t know traffic levels, service times, or other critical health metrics
You don’t have enough capacity; disks are filling up, databases are overflowing, you have too much network traffic
Critical infrastructure components like databases, operating systems, etc. don’t have patches applied
You don’t have backups
You’re trying to handle complexity in infrastructure rather than software. Things feel complicated: load-balancer rules, DNS, HTTP rewriting, traffic going all over the place, etc.

Warning: Many ops problems are actually software problems in disguise. We wasted years learning that lesson at Crittercism.

Some pointers for managing ops debt:

Engage ops early; they write big checks and have to wait on vendors. They need to know well in advance when things are changing, doubly so in companies building “ops heavy” products requiring tons of networking/storage/compute, or intense compliance requirements (HIPAA, PCI-DSS).
Given the choice, always push complexity into software. Example: if you have customers with specific data-tenancy requirements (e.g. data has to be in Europe), make one version of the software that’s aware of the data tenancy. When faced with this situation, we tried running two separate instances of the software, with different code artifacts — big mistake.
Unless you’re starting a hosting company, get out of this business as much as possible. Use a PaaS like Heroku/App Engine. Hire one ops engineer and rethink when you reach $40-50k/month in spend. When you cross that, it might be time to start managing your own virtual machines. Don’t even think about physical infrastructure (owning or leasing) until you’re spending millions/year. Spectre/Meltdown was announced last week, and I’m delighted someone at AWS is managing that mess for me; I’ll spend the time focused on customers. If cost control is the priority, you’re better off (a) changing the product or (b) building better software. Don’t let the tail wag the dog.
Deploy often. If you can’t, figure out why not and fix it. Once/day at the absolute minimum. You’ll thank me when a critical issue needs attention.

Type 4: Architecture debt

Productive dev teams have a shared understanding of “how things are done”. Creating this understanding is what people mean when they say a team has “gelled”. Architecture debt is the absence of such shared understanding.

Not gonna lie, if you have this, you’re in a world of pain. Try not to get here. And if you’re interviewing for a position in product development (engineering, product management), be on the lookout for this stuff. If you see it, negotiate a mandate to fix it as a term of getting hired, or find another place to work.

Symptoms of architecture debt:

Too many tools in use: programming languages, databases, etc. (exact number varies, but if the team has chosen to use Python and Ruby, or Java and C#, especially in close proximity, that’s a problem; likewise for MySQL/PostgreSQL)
Duplication: multiple pieces of code in the same codebase that do almost the same thing, multiple services/components in a larger system with too much functional overlap
Inconsistent naming in code, e.g. “PageRenderer”, “PagePainter”, “PagePainter” especially if they do the same thing, multiple terms used interchangably to mean the same thing, e.g. “App” and “Program”
Poorly-designed code: inconsistent layering, inconsistent separation of concerns, large amounts of unused code
Balkanized teams: little awareness of what others are doing (on their own team, or others), poor cross-team dynamics

Architecture debt happens due to team dynamics. The two leading causes:

Org structure/leadership problems, specifically, (a) lack of an identified leader whose decision is final, and/or (b) inability to enforce compliance with the leader’s decision. I’ve come around to the idea that companies shouldn’t act like democracies.
Communication pathologies. Teams who don’t talk because they don’t get along/aren’t friends/work in different offices. Again, closely related to org structure. Ben Horowitz nailed it: “Perhaps the CEO’s most important operational responsibility is designing and implementing the communication architecture for her company.”

To fix architecture debt, fix the organizational problems. You can’t really do this unless you’re hired to manage. Fixing this stuff takes a long time (months) and it’s going to slow down development; good luck, you’re fighting the good fight.

Non-debt

Sometimes you just have to rebuild, and there’s no way around it. One is sort-of debt, the other isn’t.

The sort-of debt case is better understanding of the problem domain. In a given problem domain — hotel software, photo sharing, analytics — there are certain “hard problems”. People who have experience building these systems know what’s hard, and what to do about it. You may not be able to hire them, especially early on. But let’s stop pretending everything we do is so new and novel. It isn’t. There’s a reason facebook hired Schroep and Google got Eric Schmidt.

The non-debt case is scale. Don’t overdesign. If you have a scaled customer base (important qualification), what works today will probably break along the journey to 10x today’s volume (traffic, users, whatever). Don’t worry about it. If you think scaling a technology system is hard, what’s even harder is predicting what’s going to break. Ask me about the time we hit a packet/second — packets, not bytes — on an EC2 virtual NIC between a Redis machine and a Python process; that was not the problem I thought I’d be solving that day. Plan to build a new one at 10x.

“Technical debt”: a failed metaphor

Revisiting what I said earlier, “tech debt” is a flawed metaphor.

At its core, financial debt allows exchanging cash today for cash tomorrow. If the metaphor holds, tech debt allows shortening implementation time today in exchange for “repayment” at some indeterminate future point.

But that future rarely comes. In a fast-moving industry, you can’t afford to get behind, because it’s extremely rare to “come back from the dead” in technology.

I think what’s changed, is that we don’t “ship” anymore. In days of old, Microsoft worked for three years on Windows, and then “shipped” it. Missing the holiday ship cycle could be death, so it was worth some future pain. In today’s world, where cloud-delivered software is less of a sprint and more 20 miles per day, every day, the game is long-term productivity maximization. Tech debt hurts productivity. You have to avoid it like the plague.

So before you’re tempted to cut a corner, consider that you’re pouring the foundation today for what you build tomorrow. Getting this right is like planning a city, or a garden. You start from a small, carefully-scoped inner core, then grow outward, maintaining quality every step of the way.

What you can’t do is build a shanty town, and expect to rebuild it. Plan to throw one away is yesterday’s advice. You’ll never find the time.