Avoiding gigantic changesets when refactoring

This may seem obvious to many folks but for those who haven’t had to do something like this before, but I think it’ll be very illuminating to those who haven’t.

The starting point and a naive solution

I’m currently working on a cleanup in our codebase at ZipRecruiter, where we want to remove dependencies shared between two parts of the codebase into a common area. The part I’m working on now is well-defined, but touches a very large number of modules throughout the codebase.

If I chose a naive way of doing the port, I’d do the the following:

Port the functions to be moved to a new module in the common area and ensure they’re all under test.
Go through the rest of the codebase, find all references to the ported functions, and update the modules to use the new module instead of the old.
Remove the functions from the old module now that all the other code has been ported.

If the functions in the old module aren’t used much, then step 2 is fine as is, but this is a utility module that’s used in a lot of modules. This makes step 2 a gigantic risk point because the changeset for a large number of modules essentially has to be done all at once, whether we do it as one big commit or a lot of small ones that still have to be applied at the same time.

Doing these all at once is going to be a code review and QA nightmare, so this isn’t going to fly. Unfortunately, I started on the port and didn’t come to this understanding until I’d gotten pretty far into the changes. I needed to save the work I’d done, but reuse it in a way that was safe and not an unbearable load for my reviewers and QA.

Saving the work and reducing the risk

I had a branch with the porting “done” (new module added, old module removed, changes made in the dependent modules, as far as I knew). I had made the changes incrementally, one function at a time across the codebase, as opposed to “port every function in this module, then go to the next”, so I had put myself in the position that this was an all-or-nothing change.

My first idea to reduce the change size was this:

Add the ported and tested functions.
On a new branch, cherry-pick the changes per module to make per-module commits to port from the old module to the new.
Add a commit at the end that removes the functions from the old module.

The ported modules can now be done one at a time on whatever schedule we want, so we’ve fixed the risk of “gigantic unreviewable and untestable commit”, but the last step still has the potential of breaking things we missed.

In the process of writing this strategy down, I came up with a far better option. We need to decrease the risk of the removal of the old-location module. A little thinking came up with a new strategy:

On a new branch, port the commits that create the new module with everything under test.
(New step) Stub out the functions in the old location to just call the functions in the new location and verify that the old module still passes its tests, adding any new ones needed.
Cherry-pick the changes per module to make per-module commits to port from the old module to the new, and deploy these to master at whatever granularity seems appropriate.
When we think this is done, add a warning to the functions in the old location to say “function_x() in old_location should be ported to new_location”.
Port anything else that’s throwing a warning.
Finally remove the stubbed functions from the old location.
Monitor for errors calling the functions and fix anything we’ve missed.

We now have a situation that’s decoupled: we can get the new module in place without leaving duplicate code in place. Code that needs to be separated can be ported to the new module on whatever schedule we like. Step three is no longer time-constrained or size-constrained; the work done to port already can be used later, and can be added in increments of whatever size seems appropriate. Steps 4 and beyond can be delayed as long as we like.

I started working on getting the per-module changes separated out into separate commits, and this is where I hit the first problem. Pure git cherry-pick won’t work for this; the standard process is to cherry-pick the commits we want with the -n flag so they’re not automatically committed, git reset, and git add -p to add just the changes we want. But then we can’t cherry-pick the same commits again to build a set of changes for the next module. A quick google showed me that I could instead use git show SHA file.txt | git add - to extract the changes for a specific file and repeat this for all the commits I’d been cherry-picking before; doing this for all the commits in the cherry-picks got me the combined changes for the file across all the commits into a single changeset, which could be saved as a single commit.

Scripting the breakup of the commits

I had a large number of files to do this for, so obviously the right thing was to create a script to do this work:

#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use IO::Prompter;

$| = 1;

# Reapply a set of SHAs in the proper order, file by file.
# This lets us create a set of commits that have all the changes
# in the commit_shas, but break it up into sections. Each blank
# line denotes a section; all the files in each section are
# updated together.
# The script stops after each apply of a SHA set and allows us to
# verify the update and commit it in another window.
# Use this if you've built a large commit and want to sort it out into
# more coherent commits.

my @files_to_redo = split "\n", <<EOS;
path/to/a/module
path/to/a/related/module

path/to/another/module

path/to/a/third
EOS

my @commit_shas   = qw(
SHA1
SHA2
...
SHAn
);

while (defined (my $file = shift @files_to_redo))
  if (!$file) {
    prompt "(press enter to continue with next set) ";
    next;
  }

  # Have a file, add the commits for it
  for my $commit (@commit_shas) {
    my $patch = `git show $commit $file`;
    next unless $patch;
    print "Adding commit $commit for $file\n";
    open my $apply, "| git apply -" or die "Can't apply patch: $!\n";
    print $apply $patch;
    close $apply;
  }
}
print "All files updated. Do the final commit!\n";

We use git log on the branch where we’re doing work to figure out the set of commits (our former cherry-pick commits) that create the full set of changes, and git diff vs. the first commit on the branch before these commits to get the list of files changed by them.

Plug those into the appropriate arrays above, rearrange the files into groups separated by blank lines as seems appropriate in the @files_to_redo array, and run the script. It will combine the per-file changes across the old commits into new commits, stopping at each blank line to allow you to inspect the change and commit it.

I chose that approach because I wanted to look at what was and wasn’t under test so that I could go back and add tests later to these commits – since they’re just being collected right now, we don’t have to declare that the current state of the commit is its final, to-be-pushed-to-master form. We can note which commits need tests in the commit comments, and when we cherry-pick them onto master at a later date, we can add tests and git commit --amend to add/update/fix them.

Finishing up the work

All right, so we’e got our per-file commits, and we have a commit adding the new module we’ve refactored the functions to, plus its associated tests. On a new branch, cherry-pick that single commit that creates the new module. We’ve now got the functions in both the old and new locations in the new branch.

Now we create a set of changes to just stub out the functions from the old location to simply call the function in the new location and do nothing else. The functions in the new location already pass their tests, and the tests in the old location verify that the stubs work. I chose to do these one at a time, because it’s simpler and safer this way, and each individual commit is small and easy to verify. These could be squashed later, but I’m leaving them as individual commits; the tool we have at ZR to land changes on master will apply the whole branch as a single merge commit, so leaving each step of the port as a single commit makes it easier to review.

Now that everything is stubbed and tested in the old location, this change can be landed as a regression-only change. No code outside of the stubbed module has been touched, and the tests for the stubbed module verify that we haven’t broken it (it may be necessary to backport some tests from the new module to the old if functions weren’t adequately covered, but that can go in the stubbing commit).

We now have a safe state from which we can proceed to do the refactoring of code that will only use the new module and its functions, separating it from code that depends on the old module, and we can take whatever time we need to commit the large number of saved changes we created via the script to do the porting of the other modules that depend on the old one. Since they’re single commits that we git cherry-pick onto master, we can do any conflict resolution at the time we cherry-pick them.

Summary

In summary, this was a situation where I started down the wrong track, but because of the way git stores the changes, it was possible to go back and change the order in which they were made to do a safer and saner engineering process.

I would like to specifically pause here and say thank you to Rachel Thompson (@forrachloop on Twitter), who quite late in my career taught me how to think about big changes in a way that was kind to everyone else. Thanks again, Rachel!

Avoiding gigantic changesets when refactoring

The starting point and a naive solution

Saving the work and reducing the risk

Scripting the breakup of the commits

Finishing up the work

Summary

Comments

Leave a Reply Cancel reply

More posts

Building ‘use English;’ into the Perl core

Email handling: a rant

Azuracast metadata redux

More adventures in metadata