DevProd

The Challenge of the Build Engineer

The role of the build engineer is undergoing a transformation. Build engineers are becoming the key to success for any software producing company. But this transition is not happening without a lot of pain.

Traditionally the responsibility of a build engineer was to create and pass on artifacts from development to production. More and more the focus is shifting to improving developer productivity and enabling fast and frequent shipping of software.

Companies that rely on software to run their business might find themselves at a competitive disadvantage (and even out of business) if they do not aggressively optimize developer productivity. But across the industry, developer productivity is much lower than what it should be.

Software stacks are becoming more complex, and the code bases are growing which, without taking any actions, will slow down your builds and make build failures harder and harder to debug. At the same time, there is great pressure to become more agile and release fast and often, which requires fast feedback cycles and thus fast builds. This friction and frustration start hurting the team. Why?

Over the past several years as founder and CEO of Gradle, I have met with hundreds of build teams. In this post, I want to discuss root causes of this pain — both organizational issues and how technology gaps work against the build engineers and entire development team.

Common organizational anti-patterns

Build teams are understaffed and considered 2nd class engineering

In many organizations build engineers are in a vulnerable position, without the resources to do a good job.

Traditionally build engineering has been considered a less important engineering discipline compared to application development. This function is often a scapegoat for developer productivity problems, often for behavior that is not their responsibility.

Developers often change the behavior of the build and yet the build engineers are made responsible for the result. The build is slow — but did the build engineers introduce the language with a very slow compiler? Who wrote the slow tests and who publishes dependencies that always break downstream teams? Not the build team!

But in this case, the build engineers don’t have a strong standing organizationally and also don’t have sufficient data to explore and show what impacts the build and who is accountable. How the builds behave on local machines is a complete black box.

Build engineers don’t care about developer needs

In other organizations, build engineers are in a more influential position. They do not report to the development leadership but are part of the operations team. Instead of developer productivity, their primary interest is a stable build pipeline.

This sole focus on stability creates a risk-averse process and very slow rollout of anything that could impact the stability of the build — such as new features that make the build faster, new build technology, or code refactorings.

Such an environment decreases the productivity of developers and is deeply frustrating for them, often to the point that they start leaving the company.

Nobody is responsible for the build

Often you find that there is no clear responsibility for the build.

In larger organizations, you may find build teams that are responsible for the overall build infrastructure across all teams. They are also somewhat responsible for the individual team builds — while those teams are also somewhat responsible for it. And as no one wants to have that responsibility, the only things that is happening is finger pointing and the build will deteriorate.

A variation of the above is that there is no centralized build team, but still, no one in the team is taking ownership of the build. In both cases, the build technology is often blamed for the sad state of the developer productivity. But builds are not a free lunch regardless of the technology you are using.

Many organizations are somewhere in between these extremes. But I haven’t seen an organization where there is not a good amount of friction between build engineers and developersAnd often enough, the relationship becomes completely dysfunctional. All this costs an organization dearly in employee churn, real R&D dollars and time to market.

What would a healthy culture look like?

I think it is crucial that build engineers see it as their primary role to make the life of the developers easier and more productive. At the same time, we need a culture where the work of build engineers is appreciated.

In a healthy culture, developers and engineering leaders understand that the domain of building and integrating software is just as complex and challenging as the domain of application development.

In this healthy culture, developers are constructively helping the build engineer to achieve their goals. The huge economic impact of the builds is understood and this leads to proper resourcing.

The benefits for an organization are enormous. The most successful organizations build hundreds of thousands of times a day. Why so many builds? Their successful build process enables them to build in smaller increments, to ship more updates at a higher reliability and quality. This saves millions in lost R&D and opportunity cost.

What do we need to do to get there? The necessary cultural transformation requires better technology.

Build engineers need more insight

I have visited hundreds of build and development teams in the last few years. The most striking thing to me is always how little insight the build team has about how developers actually experience the build — and how the build affects developer productivity. Maybe for the CI build the data is slightly better, but it is still far from what it should be.

Being able to get quick answers to the questions below is essential for maintaining a high performing build:

  • How many local and CI builds are run every day? (almost every team I meet with underestimates this metric, often by an order of magnitude)
  • How often do the builds fail, what are the reasons?
  • What are the most common problems?
  • What are the longest running tasks?
  • What are the most frequently executed tasks?
  • Is the build behaving differently for different team locations?
  • How is the build behaving for people who work from home?
  • How good is the incrementality and cacheability of the build?
  • How often and when is the build unreliable?
  • Are developers using development tools effectively (e.g. avoid cleaning when not necessary?)
  • What is the economic benefit of investing in certain improvements?

Yet, only a very few companies are able to answer those questions.

Here is what I consider the requirements for efficient build analytics:

  • Data must be collected from each and every build, local as well as CI. This is the only way to efficiently deal with flaky and unreliable builds.
  • The data must be easily explorable without the need to dig through a variety of raw data sources.
  • Once your analytics points to a pathology, the data must be deep so that it can guide you to the root cause of the problem without requiring to reproduce the problem (which often requires cooperation from developers).

With such a service in place, build engineers can continuously, proactively and efficiently check for any regressions and improvement opportunities regarding build performance and reliability. It also enables them to teach developers better ways to use the build if they see wrong or inefficient usage.

Without it, it is risky to roll out any new build features and improvements. Every change can introduce regressions across the organization or for parts of the organization. If you can not monitor the effect of such a rollout and have to wait until unhappy developers knock on your door, you’d better be very conservative.

At an insurance company we collected build data over several weeks. They had numerous small Spring Boot projects. What surprised us was that there were local builds that took 20 minutes to build. We looked at the CI build time, which was only a few seconds. We then realized that the long build time comes from people working from home and using a network drive for their build directory.

At a large SaaS company we found some surprisingly long running CI builds. They took more than an hour to finish. When we looked at the build data that we collected, we saw that for those builds 45 minutes was spent on garbage collection and that the JVM running Gradle needed more memory. This was going on for years.

Those are examples of extreme inefficiencies that went unnoticed for a long time. But smaller inefficiencies are also impactful. For a typical team of several hundred developers, just one minute of improved build time — just for local builds — means literally millions of dollars in reduced waiting time.

More insight enables more accountability

With better data and insights you can now create accountability. If developers are complaining about the slow build, you can finally show them that it is badly written unit tests or the way the dependencies are organized that makes the build slow.

Conversely, if developers feel that their problems regarding build performance and build reliability are ignored, they can use this data to make a strong point that this problem costs the organization dearly and needs to be fixed.

Stronger measurement allows build team to show impact

Let’s face it — the work of the build team is traditionally undervalued, and the value they add often goes unnoticed.

Imagine you’ve made the build 20% faster, or — despite a massive growth of the codebase — there was no regression in build time. Who will notice?

The flip side of this is resourcing — your team requires more people or hardware to handle more complexity — how do you quickly make your case?

Having data that shows the value the build team is adding is key for making investment decisions into this area. As well as being a driver of positive morale, mutual appreciation, and a source of collective team pride.

Debugging and Identifying Build Issues must be efficient

Investigating and collaborating on build issues is often a painful and inefficient nightmare. A typical workflow looks like this:

  1. Developer reports issue to build engineer by attaching the log file and trying to provide the necessary context.
  2. Build engineer tries unsuccessfully to reproduce and asks developer for more information including a more verbose log file.
  3. Developer runs the build again to produce the requested build log and attaches the new build log to the issue.
  4. Build engineer again tries to reproduce the problem and runs the build multiple times to find a setup to reproduce. Again this is not successful.
  5. Build engineer ask the developer to create a dependency reports, a build environment reports, a copy of the properties file and whatever else is need to investigate further.
  6. Developer is providing the additional information.
  7. Build engineer is trying again to reproduce the problem.
  8. Potentially several debugging like the ones above until the root cause is identified.

This is why debugging a build issue requires hours of developer and build engineer time. For flakey or intermittent issues- the situation is even worse. You are often working with only anecdotal evidence and your witnesses are people that are not expert in builds. It is already a challenge to just find out what exactly is happening, let alone the root cause of the problem.

As discussed above, what is needed is a build infrastructure that automatically captures a durable record for every single build that is been executed. Then — nothing needs to be reproduced. Instead, file an issue with a link to the build data.

Again, the harder and more painful it is to solve a build problem, the more conservative the build engineer will have to be before rolling out any changes and improvements. This is a high opportunity cost and the beginning of a downward spiral.

There is also a relationship component to that. If developers have problems with the build and they know it will cost them a lot of their time to fix the issue — they live with the situation until it becomes unbearable. When he or she finally escalates to the build team, the issue is already very emotional and the cooperation starts off with a negative sentiment.

Unreliable builds also create problems that appear to be problems with the code. Let’s say a developer thinks something is wrong with the latest changes. After hours of investigation, they figure out that it was a false alarm by the build system (for example, because a stale file was not removed). This creates tremendous frustration. And in future debugging situations, they will always doubt the signal of the build, adding an uncertainty to debugging. This creates deep resentments. A key to reliable builds is to make it easy and efficient to debug build issues.

Build engineers should not be required for debugging problems with the code itself

When we say the build is broken or the build failed it can mean two things.

Either something is wrong with the build itself, (e.g. an out of memory exception when running the build) and we talked about those kinds of failures above.

However, the most frequent reasons for build failures is the very purpose of having a build: Detecting problems with the code. While many times the reason for failure is obvious to the developer, often enough it is not.

In such cases, resolving such an issue then requires the help of the build team, although they don’t know the code. On the other hand, the developers know the code, yet it is often too hard to debug the build because:

  • They are not trained in reading the log file of the build tool
  • They often don’t know how to get additional information from the build
  • On CI, they only have the build log, but no other data like dependency reports or build environment information.

For example, the local build or the build of a certain branch succeeded. Now the latest CI build is failing that has your commit. What might be the reason for that?

  • Has CI used different dependencies?
  • Are different versions of a source code generator used?
  • Is there a different environment setup?

For a developer, it is extremely difficult or impossible to get an answer to such questions without the help of the build team. And for the build team, it is a lot of work to figure out the answers.

In fact, often the majority of the build engineers time is spent on support. And excellent build engineers don’t want to provide support all the time. They want to create internal products that have an impact on the organization.

Developers need technology that allows them to explore and get answers to such questions easily by themselves. This will boost their productivity every single work day, and enable all developers — including the build engineer — to focus on what they love, which is building products.

Conclusion

In many industries, there are engineering disciplines dedicated to the practice of making production efficient and effective, such as chemical engineering and industrial or process engineering. They have their own degrees and are much respected within their broader field.

Build engineering is of similar importance when it comes to the production of software. The software industry is at the beginning of this realization. It is a very exciting time to work in this area!

A recognition of this practice will create a healthier organizational relationship, and a better technology is required to make this successful.

Organizations get tremendous benefits when they invest in their build infrastructure. My next article will be about the costs and returns on investments in builds.

Hans Dockter is founder and CEO of Gradle, the company behind the Gradle open source build tool and commercial products that help development teams accelerate software delivery. You can follow him on Twitter at @hans_d.