Monday, October 20, 2008

The engineering manager's lament

I was inspired to write The product manager's lament while meeting with a startup struggling to figure out what had gone wrong with their product development process. When a process is not working, there's usually somebody who feels the pain most acutely, and in that company it was the product manager. Last week, I found myself in a similar situation, but this time talking to the engineering manager. I thought I'd share a little bit about his story.

This engineering manager is a smart guy, and very experienced. He has a good team, and they've shipped a working product to many customers. He's working harder than ever, and so is his team. Yet, they feel like they are falling further and further behind with every release. The more features they ship, the more things that can go wrong. Systems seem to randomly fail, and as soon as they are fixed, a new one falls over.

Even worse, when it comes time to "fix it right" the team gets pushback from the business leaders, who want more features. If engineers want more time to spend making their old code more pretty, they are invited to do so on the weekends.

Unfortunately, the weekends are already taken, because features that used to ship on a friday now routinely cause collateral damage somewhere else on the platform, and so the team is having to work nights and weekends cleaning up after each launch.

Every few months, the situation comes to a head, where the engineers finally scream "enough!" and force the whole company to accept a rewrite of some key system. The idea is that once we move to the new system (or coding standard, or API, or ...) then all the problems will be solved. The current code is spaghetti, but the new code will be elegant. Sometimes the engineers win the argument, and sometimes they are overruled. But it doesn't seem to matter. The rewrite seldom works, and even when it does, a few months later they are back in the same dilemna, finding their now-old code falling apart. It's become "legacy code" and part of the problem.

It's tempting to blame the business leaders ("MBA-types") for this mess. And in a mess this big, there is certainly blame to go around. But they are pushing for the things that matter to customers - features. And they are cognizant that their funding is limited, and if they don't find out which features are absolutely critical for their customers soon, they won't be able to survive. So they are legitimately suspicious when, instead of working on adding monetization to their product, some engineer wants to take a few weeks off to go polish code that is supposed to be already done.

What's wrong with this picture? And why is the engineering manager suffering so badly? I can relate to his experience all too well. When I was working my first programming jobs, I was introduced to the following maxim: "time, quality, money - pick two." That was the watchword of our profession, and I was taught to treat with disdain those "suits" who were constantly asking for all three. We treated them like they had some kind of brain defect. If they wanted a high-quality product done fast, why didn't they want to pay for it? And if money was so tight, why were they surprised when we cut corners to get the product out fast? Or went past the deadline to get it done right?

I really believed this mantra, for a time. But it started to smell bad. In one company, we had a QA team as large as our engineering team, dozens of people who worked all day every day to find and report bugs in our prodcut. And this was a huge product, which took years to develop. It was constantly slipping, because we had a hard time adding new features while trying to fix all the bugs that the QA team kept finding. And yet, it was incredibly expensive to have all these QA testers on staff, too. I couldn't see that we were managing to pick even one. Other, more veteran programmers told me they had seen this in many companies too. They just assumed it was the way software companies worked.

Suffice to say, I no longer believe this.

In teams that follow the "pick two" agenda, which two has to be resolved via a power play. In companies with a strong engineering culture, the engineers pick quality. It's their professional pride on the line, after all. So they insist on having the final say on when a feature is "done" enough to show to customers. Business people may want to speed things up by spending more money, but enough people have read the Mythical Man-Month to know that doesn't work.

In teams that have a business culture, the MBA's pick time. After all, our startup is on a fixed budget. They set deadlines, schedules, and launch plans, and expect the engineering team to do what it takes to hit them. If quality suffers, that's just the way it is. Or, if they care a lot about quality, they will replace anyone who ships without quality. Unfortunately, threats work a lot better at incentivizing people to CYA than getting them to write quality software.


A situation where one faction "wins" at another's expense is seldom conducive to business success. As I evolved my thinking, I started to frame the problem this way: How can we devise a product development process that allows the business leaders to take responsibility for the outcome by making conscious trade-offs?

When I first encountered agile software techniques, in the form of extreme programming, I thought I had found the answer. I explained it to people this way: agile lets you make the trade-offs visible to whole company, so that they can make informed choices. How? By shipping software early, you give them continuous feedback about how it well it's working. They can use the software themselves, since every iteration produces working (if incomplete) code. And if they want to invest in higher quality, they can. But, if they want to invest in more experiments (features), they can do that too. But in neither case should they be surprised by the result. Sound good?

It didn't work. The business leaders I've run this system with ran into the same traps as I had in previous jobs. I had just passed the burden on to them. But of course they didn't feel reponsible for the outcome - that was my job. So I wound up with the worst of both worlds: handing the steering wheel over to someone else, but then still being blamed for the bad results.

Even worse, agile wasn't really helping me ship higher quality software. We were using it to get features to market faster, and that was working well. But we were cutting corners in the development methodology as well as in the code, in the name of increased speed. But because we had to spend more and more time fixing things, we started slowing down, even as we tried to speed up. That's the same pain the engineering manager I met with was experiencing. As the situation deteriorates, he's got to work harder and harder just to keep the product from regressing.

It was my own failure to ship quality software in the early days of IMVU that really got me thinking about this problem in a new way. I now believe that the "pick two" concept is fundamentally flawed, and that lean startups can achieve all three simultaneously: quickly bring high-quality software to market at low cost. Here's why.

First of all, it's a myth that cutting corners saves time. When we ship software with defects, we wind up having to waste time later dealing with those defects. If each new feature contains a few recurring problems, then over time we'll become swamped with the overhead of fixing and won't be able to ship any new features.

So far, that sounds like just another argument for "doing things right" the first time, no matter how long they take. But that's problematic too. The biggest form of waste is building software nobody wants. The second biggest form of waste is fixing bugs in software nobody wants. If we defer fixing bugs in order to bring a product to market sooner, and this allows us to find out if we're on the right track or not, then it was worthwhile to ship with those bugs.

Here's how I've resolved the paradox in my own thinking. There are two kinds of bugs:
  • One kind are what I call defects: situations where the software doesn't behave in a predictable way. Examples: intermittently failing code, obfuscated code that's hard to use properly, code that's not under test coverage (and so you don't know what it does), bad failure handling, etc.

  • The second kind of bugs are the type your traditional QA tester will find: situations where the software doesn't do what the customer expects. For example, you click on a button labeled "Quit" and in response the software erases your hard drive. That's a bug. But if the software reliably erases your hard drive every time, that's not a defect.
The resolution to the paradox is to realize that only defects cause you future headaches, and cannot be deferred. That's why we need continuous integration and test-driven development. Whenever we add a feature or fix a bug, we need to make sure that code never goes bad, never mysteriously stops working. Those are the kinds of indefinite costs that make our team grind to a halt over time. Traditional bugs don't - you can choose to fix them or not, depending on what your team is trying to accomplish.

Defects are what make refactoring difficult. When you improve code, you always test it at least a little bit. If what you see is what will ultimately make it into production, it's pretty easy to make sure you did a good job. But code that is riddled with defects is a cameleon - one moment it works, the next it doesn't anymore. That leads to fear of refactoring, which leads to spaghetti code.

By shipping code without defects, the team actually speeds up over time. Why? Because we never have to revisit old code to see why it stopped working. We can add new team members, and they can get started right away (as an aside, new engineers at IMVU were always required to ship something to customers on their first or second day). And the whole team is gettng smarter and learning, so our effectiveness increases. Plus, we get the benefit of code reuse, and all the great libraries and tools we've built. Every iteration, we get a little more done.

So how can I help the engineering manager in pain? Here's my diagnosis of his problem:
  • He has some automated tests, but his team doesn't have a continuous integration server or practice TDD. Hence, the tests tend to go stale, or are themselves intermittent.
  • No amount of fixing is making any difference, because the fixes aren't pinned in place by tests, so they get dwarfed by the new defects being introduced with new features. It's a treadmill situation - they have to run faster and faster just to stay at the level of quality/features they're at today.
  • The team can't get permission from the business leaders to get "extra time" for fixing. This is because the are constantly telling them that features are done as soon as they can see them in the product. Because there are no tests for new features (or operational alerts for the production code), the code that supports those new features could go bad at any moment. If the business leaders were told "this feature is done, but only for an indeterminate amount of time, after which it may stop working suddenly" they would not be so eager to move on to the next new feature.
Here's what I've counseled him to try:
  • Get started with continuous integration. Start with just one test, a good one that runs reliably, and make sure it gets run on every checkin.
  • Tie the continuous integration server in with source control, so that nobody can check in while the tests are failing.
  • Practice five why's to get to the root cause of future problems. Use those opportunities to add tests or alerts that would have prevented that problem. Make the investment proportional to the problem caused, so that everyone (even the business leaders) feels good about it.
My prediction is that these three practices will quickly change the feeling around the office, because the most important code will wind up under test quite soon (after all, it's the code that people care the most about, and so when it fails, the team notices and fixes it right away). With the most important code under test, the level of panic when something goes wrong will start to decrease (because it will tend to be in less important parts of the product). And has the tension goes down, it will be easier to get the whole team (including the MBA's) to embrace TDD and other good practices as further refinements.

Good luck, engineering manager. May your team, one day soon, refactor with pride.
Reblog this post [with Zemanta]

22 comments:

  1. Laughed out loud at this post, because it so applicable to what I'm doing in my new gig.

    Our team's goal right now is increasing our system's scalability. What is the first thing I'm having the team do? Write and automate tests to reduce the defects preventing us from getting a successful load test run on each attempt. Why? Our quality and engineering services folks were burning more hours patching over intermittent failures in our load test environment than our development engineers were improving the scalability.

    It can be a little awkward to ask a team used to code-and-fix (and-fix, and-fix) to stop muscling out code and instead stop and fix the defects preventing success. But, it is so worth it.

    ReplyDelete
  2. Great stuff! A couple thoughts:

    ONE

    Time, quality, and money are three dimensions.

    Scope is the fourth. Everyone forgets about scope. Reducing scope lets you deliver on the first three dimensions. Not fixing bugs is a type of scope reduction but there are many other ways to reduce scope.

    Reducing scope is the theoretical basis for the conventional wisdom that startups should do "one thing really well."

    TWO

    See page 199 of Poppendieck and Poppendieck's new(er) book for a finer analysis of defects and bugs:

    Unit tests prove that the code does what the developer's want it to do, i.e. there are no defects.

    Acceptance tests determine that the code does what the customer wants it to do, i.e. the features are sufficient.

    Exploratory tests (traditional QA) determine that the code doesn't do what the customer and engineers don't want it to do, i.e. there are no bugs.

    ReplyDelete
  3. Eric: you say "new engineers at IMVU were always required to ship something to customers on their first or second day".

    This sounds a little crazy to me. Would you recommend this practice for other companies?

    ReplyDelete
  4. @nathan, I know it sounds crazy, and I hope to talk about it in detail in a future post. I do recommend it, but not for its own sake. It requires two things:

    1. a completely automated and easy to setup development environment. Debugging problems with the dev environment itself is a common form of waste that this practice eliminates.

    2. a robust set of defense mechanisms against a bad change making it to production. A new employee is more likely to make a mistake than an experienced one, but everyone makes mistakes. If you have confidence that your tools make it easy to understand the impact of a change, make it easy to prevent a bad change from making it to production, and make it easy to recover from a bad change that gets through your defenses - then how much damage can a new employee cause?

    Of course, we wouldn't have a new employee write a huge new feature on their first day - usually we'd start with a simple cosmetic change or bug fix. But that kept the pressure on us to make sure we were keeping our tools simple and easy to use.

    ReplyDelete
  5. There is one assumption in your (in general well thought-out) post that is problematic.

    I would posit that what matters to customers is NOT features -- and that in fact this assumption is a large cause of much of the pain you're describing.

    ReplyDelete
  6. Thanks for the post. I was a manager for 7+ years.

    In my view, it's not time, quality, or money. It's always: fixed time, chase quality, and money is tight.

    My solution has been to change the question to: time or features. Quality is never negotiable to me.

    That's a more realistic approach. A deadline is either rock solid or it is flexible. When you are hitting the limit, it's best to prioritize features and focus on the most important ones.

    ReplyDelete
  7. @chris - I couldn't agree more. Anything that doesn't contribute to figuring out what customers want is waste, even if it's a cool feature.

    ReplyDelete
  8. @larry - well said. I think where we get into trouble is we have many definitions of quality. If we take the broadest possible interpretation, and we insist on letting engineers unilaterally decide that they can't ship without "quality," we could get in trouble.

    ReplyDelete
  9. *All* of the factors involved in delivering software are affected by starting with baseline requirements that are constrained to what the system should do and the process artifacts desired (say 5 forms reading and writing a database store).

    Too often developers don't start with requirements per se but fuzzy wish lists that creep specifications into *how* to implement the thing.

    Now that changes the relationship of what is actually required to what is desired in a business fantasy fest.

    All the commitment of time, money, and quality angst can never overcome the fishing expedition effect of bad requirements.

    Excepting for cosmically co-incidental success stories, the fuzzy requirement stuff never congeals as a holistic engineering exercise. More likely these systems resemble a pastiche of Rube Goldberg components that are never quite logically symmetric.

    The choices you make are inconsequential unless you start with a clear idea of sufficiency.

    ReplyDelete
  10. Software Eng. here who spent a couple years in Eng. Mgt. then five years in Product Mgt. I also have to include this caveat: we weren't purely a software company - we built telecom network devices, hw and sw.

    Keeping in mind that caveat, in our world the biggest factor in total project ROI was time to market. Also, our stuff was pretty much useless to our customers if it wasn't full featured AND reliable. Of course, the sales folks had new features as their #1 priority. Top management's #1 was usually ROI. Engineering's #1 of course, was quality. What a freaking mess. Everybody was constantly mad at everybody and finger pointing took both hands. I spent more money on gin...

    I/we came up with various "fixes" but nothing worked, not for long anyway. There was an underlying problem which we weren't addressing. I think your solution misses the underlying problem as well.

    We ended up rethinking our entire product development process, top to bottom, start to finish. It was painful for a lot of people. But the results were excellent - customer satisfaction, sales, profits, quality, attitudes all improved dramatically.

    The major change was never starting any project without sales, marketing, manufacturing, field support, purchasing, etc. ALL signing off on the (detailed, specific, traceable and testable) requirements and schedule. For many projects, customer approval and sign off was required as well. Second was not changing the requirements without full sign-off.

    Those two process improvements alone made a huge difference. Most of the other processs changes - mandatory design reviews (prelimninary, critical, etc), - documenting all our procedures, and so on - were to support those two factors.

    Summing up my $0.03, eliminating the "silo effect" where people can throw things over the wall by bringing people from every business function into the development process from the start is the way to permannetly address the underlying problem.

    ReplyDelete
  11. I'm keen on the two-kinds-of-bugs thing. It might be more precise to categorize them of two kinds of flaws: flaws in implementation, and flaws in design. Implementation flaws are "defects" as you define them, and I agree that you really don't want any defects. Flaws in design as a different story, and shouldn't be treated the same way; a user can live with software that doesn't have one of the features advertised, but he can't live with software where a feature sometimes does one thing, and sometimes another.

    ReplyDelete
  12. @peter - on the silo issue, you might want to take a look at the companion piece to this one The product manager's lament. I strongly believe building cross-functional teams.

    One thing I will add, though, that is unique to the startup experience: writing a detailed spec is usually impossible. The reason is that, unlike your traditional agile situation, you don't know what problem you're trying to solve. All you have is a set of hypotheses (which should be written down). Each iteration should help you validate or refute some of those hypotheses. But the product at the end of several iterations generally bears little resemblance to what was originally "required."

    ReplyDelete
  13. The time, money and quality dimensions are referred to, in the project management world, as the triple constraints. Any serious PM has to have a healhty and practical understanding of these trio.

    In PM land, we call them Work (money), Duration (time) and, utilization (scope). Some people refer to scope as effectiveness, Units, availability or quality.

    I think this blog clearly shows that having a PM that can influence development and management teams can have a tremendous positive impact on the project's success and ultimately the company as well.

    ReplyDelete
  14. @Eric On Startups:

    To me the problem is the concept of *detailed* specs. In far too many companies, massaging the same frikkin' paragraph for a month is considered requirements gathering.

    The only details an engineer might need is: Create a form with these fields, a, b, c.

    a is name, b is address, and c is tax percentage.

    Instead the engineer gets a twenty page document that took six months to cut and paste together containing the signatures of PMs, VPs, foreign dignitaries and gatekeeper of the treasury.

    And the 20 page thingy is more confusing than the sentence of what's sufficient.

    @situmam On triple constraints.

    You're mixing up development concepts with project management concepts. These pairs of triplets are orthogonal not alike.

    You are confusing the label for the thing.

    For example, a PM time unit is far different from a development one. A project usually has an absolute duration and budget whereas the time and money dedicated to development within the project is where the tradeoffs are made.

    And projects don't have quality per se. They have deliverables that have quality constraints.

    The PMs relationship to quality is a step removed from the development team that is directly affected by the quality every day.

    ReplyDelete
  15. @Caretaer, ti spoint of yours makes a lot of sense
    Too often developers don't start with requirements per se but fuzzy wish lists that creep specifications into *how* to implement the thing.

    I often see PMs walking up to engineers saying "let us add another column to this db that will help me get info on foobar" and engineers saying "that is a multi valued attribte and os we need ot create a new table with references". Granted that this is an extreme case but a description of the requirements purely in terms of implementation causes a lot of trouble.

    ReplyDelete
  16. Philip B. Crosby made this argument in his book "Quality Is Free: The Art of Making Quality Certain" (Mentor, 1980)

    ReplyDelete
  17. Although I haven't read Crosby's book and I don't doubt the veracity of his insights, I think the problem is far more complex these days.

    In 1980, the transformation of business was largely a transition from manual processes to rudimentary computing solutions. This was the age of Alvin Toffler's 'Future Shock'. Computing was a closed society.

    Today, the transformations are often from migrating all those rudimentary legacy systems into a technological windtunnel described by Ray Kurzweil in 'The Singularity is Near'.

    It is not uncommon to be working on a project with a fixed budget of three to six months, a development window of 2-5 months, and ZERO wiggle room.

    In indubitably, the project will be saddled with a set of business analysts whose richest experience with an advanced system has been to browse some web pages or order something online. *They* will consume, in one form of obfuscation or sheer ignorance, 90% of the time, resources, patience, and goodwill toward getting the project done. In their wake, will be a tortured/retarded design, techno-tribal one-upsmanship, and a desperate to-hell-with-any-and-all-best-practices, get-the-damned-thing-out-the-door stampede.

    One need look no further than the hedge-fund calamity as a shining example of systems and shysters presumably generating enormous wealth. I would ask fellow professionals what these systems were doing to make this kind of thing possible - no one had any idea except to believe that these systems were so sophisticated at playing/manipulating the market that they were the ones consuming all or most of Wall Street's magic profits. In other words, these systems were built fast and cheap to establish a set of smoke and mirrors no one would ever be able to penetrate quickly enough to realize or abate the damage. And no one got paid who asked questions - outsourcing provide a perfect petri dish for complacency, conformity, and duplicity.

    In that stew of seemingly limitless wealth creation that no one dared question, a cult of 'extreme' and 'agile' practices flourished that at face value asserted that businesses "didn't need any stinkin" design more complex than that that immediately satisfied the gratification urge.

    IMO, this sharp right turn in our industry toward compromising the integrity of process [where quality matters] toward eliminating the cost of integrity has both incapacitated project development AND created and enabled a shadow profession to corrupt the profession.

    Our profession is in trouble. It is as if the staffing and management of projects is run by international syndicates more interested in controlling job territories than in assembling vital, dynamic teams who are empowered to succeed.

    - krasicki

    ReplyDelete
  18. @Eric,

    Is this post about Architecture? I define Architecture (capital A) as "the advocacy for human values in a process" (not little a architecture, over-arching structural approach, just design-in-the-large).

    Cost and time are effectively absolutes (The Caretaker's high finance schenanigans and 20th century Physics aside). Quality is an abstraction of value, and value of service (or changes) is situated in human contexts. Value of comoddities is not situated in a human context (perhaps this is the source of the confusion) - the market determines a price for Grad A rice vs. Grade B rice (or whatever) just as it efficiently as it determines a price between Grade A rice and Gold - but quality/value of services are effectively orthoginal to cost and time, because different contexts have different value drivers.

    Actually, "the market" and "commodities" are just a way of saying "shared context", but that's just semantics...

    The story of the engineering manager is one of an Architectural challenge - positive value expectations in one context (e.g. new features please, suit) are conflicting limiting factors in another (e.g. less pain please, scruff). Most often conflicting value drivers are resolved with a ballanced compromise decision (either deliberately, or via some ecconomic method). E.g. as many features as possible and as much pain as is bearable.

    But sometimes an Architect finds a way to respond to challenge by redefining the process (re-aligning it to values) to fit the humans; instead of trading-off values against each other. That's value-creating, it's why they get to wear bow ties.

    ReplyDelete
  19. @The Caretaker, there is no such thing as more (or less) complex; there is complex, and there is not complex.

    I'm sure we all agree the technology landscape has changed in the last 30 years, but are you blaming agile development for the global financial crisis? wow.

    ReplyDelete
  20. Excellent post Eric. I'm a die hard believer in TDD. Sometimes I find myself working on a project where unit tests weren't written let alone in a TDD fashion. The resulting code is such that it's very hard to add in tests. It seems that a project can get to a point where the "technical debt" is so high that the project is bankrupt meaning it would be "cheaper" to rewrite than try to add tests while you are chasing bugs. Of course it's hard to completely throw away code the company has invested in. I see big companies throw code away all the time but it seems harder for smaller companies to do it.

    Any advise on how the decision to rewrite may change for lean startups?

    ReplyDelete
  21. @anonymous

    We do not live in a black and white world.
    Complexity is a rich set of understandings that need to be assimilated in part and whole. The intellectual capacity of the beholder introduces granularities of complexity.

    Secondly, I don't think I blamed the economic meltdown on agile development. OTOH, I do think there's a connection between macro business process requirements being so myopically and expediently defined at a granular level that the resulting macro process ignores both the risk and the consequence of such lightweight analysis.

    After all, the entangled nightmare we watched implode was a story that should have been factored in, no?

    ReplyDelete