The development train

David Albrecht
rude mechanicals
Published in
6 min readJan 10, 2018

--

Before shortbar, I was engineer #4 at Crittercism. It’s no secret this place crashed and burned; even so, I had a great time working there, learned a lot, and even made a few friends.

First lesson: a software product organization is a train: product strategy pulls development, which pulls operations. “Technical operations” problems (availability, outages) are often at their core due to poor development practices; dev issues, problems with product strategy. When things go really wrong, the tendency is to point fingers at first-order causes, and “for heads to roll”. Strong leaders resist this tendency, looking deeper to find the root causes of trouble.

An exciting dumpster fire

When I started (May 2012), excitement was high. We’d just landed The Big Customer — Netflix — which boosted morale, and critically, credibility. Those parts were good. What wasn’t good was the tsunami of traffic getting dumped on us: event data every time someone opened Netflix, even more when it crashed. The product was down for days, it was a disaster, and everybody knew it. It was time to get serious about reliability.

I got to work right away trying to put out the fire. Big problem #1: no monitoring. Nothing was measured. Ask any experienced cloud services engineer about monitoring, and you’ll get an earful. It’s a huge topic and there are an arsenal of tools: commercial offerings like Pagerduty, Cloudwatch, and New Relic, to do-it-yourself solutions like nagios, graphite, and sensu. We had none of this, just 100+ manually-provisioned Ubuntu VMs on EC2 . I was particularly bothered by this state of affairs because I’d just come off running my own company, and we measured everything. We knew how many people visisted us each day, how many had signed up, 7- and 14-day retention numbers, all of it. This was going to be my first task. I, was going to know whether our product was operating, dammit, how much load we were taking, and how those numbers fluctuated over time.

Lesson #1: You have to measure. The founding team must instill a culture of measurement. At any point, everyone in the company should know what’s important, and how it’s being measured. You must budget for the systems and processes required to collect this data.

A few days later, I’d learned enough Chef to be dangerous, rigged up some netcat scripts, and got a graphite collector online. I made a simple HTML page that refreshed once/minute with the graphs on it, and put it on a 24" monitor. That gave some idea what was going on, but also made a strong statement to the team: this is important.

Lesson #2: Measurements must be communicated. Well-placed video displays are a good tool.

When you display something on a monitor, in a prominent place, people notice it, and they pay attention. Walk through any tech office in San Francisco, and the first thing you’ll notice, there are displays everywhere. I’m surprised how few people outside of high tech have figured out how unreasonably effective this practice is. Whatever a company’s goals, whether customer satisfaction, revenue growth, wait times, whatever, it needs to be on a huge LCD placed prominently, where people will notice it, e.g. on the way to the restroom.

The money problem

A couple months passed and we got past daily outages. Work became an informal cat-and-mouse between engineering, who kept adding capacity, and sales, who kept adding load. It was fun, and productive.

At the time, my girlfriend (now wife) lived in a different city, so I spent a lot of time at 21st Amendment with my friend Felix. He’d just moved from Germany, and had remarked how “web-obsessed” SF’s engineering culture was, like we’d forgotten how to build any other kind of software (e.g. games, industrial control, phone firmware).

Looking back, Felix was right; we were doing analytics at Crittercism, but you’d never know it from our technical architecture. It was classic three-tiered “web: load balancers, application servers, synchronous database writes. It scaled, but manually, and very expensively. We were spending something like $80k/month, for a 15-person company making maybe two million/year in recurring revenue. We started to have serious discussions about profitability and whether our current offering was commercially viable. The circumstances led to two developments. I opposed one and supported the other. Both ended poorly.

The first: we decided to charge a lot more, and sell using a sales team. I’m not sure this was something the founders ever wanted to do, but it was the only option given our cash burn and lack of organic distribution. This decision came with all the usual shenanigans, including confusing hidden pricing, and mandatory conversations with sales reps. It also brought the complexity of a sales team into the company: much longer cycles and higher deal cost, arguments over commissions and expense controls, territory fights, and demands for product customization. The office also got a lot louder.

(Sidenote: San Francisco founders are notoriously averse to hiring sales teams, preferring high-velocity, self-service customer acquisition. I think the culture has become too anti-sales, but only a little.)

The second: infrastructure, stability, and cost optimization became the dominant theme of the engineering organization, rather than customer needs, growth, or building features. We ended up with a VP of Engineering, selected largely due to his prior experience running a hosting company. “Predictability”, “efficiency”, and “cost control” became the watchwords of engineering.

It was only later that we realized we were fighting the wrong battle, but it was already too late.

Going deeper: the software problem

About a year later, the decision came down that we were going to hire a more hands-on director of engineering. I had little visibility into this decision, as by this point, I was “just a programmer” who’d gone from almost-nightly dinners with the founders to a grunt programmer laboring under two levels of middle management. (A strange outcome, given I was nominated to an investor’s presidents club for very high performance at the company.) On the other hand, it must’ve been pretty clear things were heading sideways to the exec team, as, in the months after the director hire, we had 30 layoffs (of a 100-person company), the VP of Engineering was fired, and the CEO was replaced. We also lost several lighthouse customers, and I’d spent way too much time fighting outages. Hell, we even had an office burglary in there somewhere. I’d put on about 10 pounds from stress eating and lack of exercise. Our VPE tried to get me anger management counseling.

“Startups”.

Lesson #3: When a company is failing, individual performance isn’t recognized. There’s no scope for career growth at a failing company. Eric Schmidt was right — get on a rocket ship if you want to grow.

Anyway, our director of engineering, Kevan Dunsmore. Well-liked and competent, he pointed out that our “website” was the real problem; what we needed to build was a high-volume “big data” system with services, batch processing, etc. This would be a wholesale rewrite of a big piece of the system, violating one of Joel’s commandments.

It was finally, at this point, that we had a collective realization that ops — the cost, the complexity of managing hundreds of VMs, all of it — was the wrong problem to solve. We needed better software. Better software was the true lever to drive down cost, increase performance, and ultimately, deliver more customer delight.

The development train

That’s when it hit me — software drives ops. If you want to solve your ops problems, focus on building better software.

Coinbase just had a big outage last week. They’re growing really fast, adding something like 300,000 accounts/week. By number of accounts, they’re bigger than Schwab. I wonder if they’ll figure this out.

Looking back even further, the real problem with Crittercism wasn’t even a software problem. We were pretty good at selling things, passable at building them, but the real failure was even deeper: we never quite knew what we wanted to build. Were we going to be a user-facing helpdesk product like Uservoice or Zendesk? Maybe a machine performance analytics company like New Relic? Or a customer/user engagement platform like Mixpanel, Amplitude, or KISSmetrics? I’m not convinced we ever knew.

You never know 100% on day 1. But you have to figure it out, decide, and commit. Because if you don’t even know what you’re trying to build, you’re never going to get anywhere.

--

--