Characterization testing – refactoring legacy code with confidence

You probably know the feeling.

You’re staring at a bunch of unintelligible legacy code with enormous cyclomatic complexity, 5kLOC methods, no tests whatsoever, and now you need to make a change without completely breaking it. So, a massive refactoring is needed to make sense of the mess before even thinking about making the change. But how to do that safely without losing your sleep?

Enter characterization testing – also known as golden master or approval testing.

The idea is simple. Let p be a program that consumes input x and produces output yy = p(x).

Due to the mess, and the original developers long gone, we can’t exactly figure out how p should actually work, so we cannot make test assertions like “p(3) must equal 5” out of our heads. However, we can observe what the output y is for any given input x by simply running the program. We can then record a set of pairs (x,y), and write an automated test that checks that y == p(x) holds for each pair. Given that enough of the input space (x) is covered, this test effectively checks whether the original behavior of p is preserved – whatever it is, including bugs that famously have become features! This is the key to safe refactoring.

Note that the preservation of non-functional characteristics like time/space complexity cannot be tested this way.

But a couple of questions arise: what if p is indeterministic, i.e. running p(x) repeatedly with the same x returns different results? There are infinite number of inputs to test, how can we ensure enough of input space is covered?

Fortunately these issues can be sorted out.

Preparations – some determinism

The first step is to make sure that the code we’re about to characterize can be made deterministic – same inputs should produce the same outputs every time within the test. We can do this by making parts of code that could produce time or state-dependent results replaceable with test doubles. Examples: external resource access, random number generators, accessing system clock etc. All this needs to be achieved with minimal, safe changes (like extract method and rename) and without changing the existing interface.

Start identifying the problematic dependencies from the code and make them injectable. There are many ways of doing this without changing the interface. One neat trick in object-oriented languages is to refer to a dependency indirectly through a method instead of direct reference, and deriving a testable version of the class in the test, overriding these “seam” methods to return test doubles.

Let’s have a look at (somewhat synthetic) an example. There seems to be two dependencies that cause indeterministic behavior: System.currentTimeMillis(), and Dependency that after a quick look at the source seems to access an external resource:

public class ExampleClass {
    private Dependency dependency;
    public String method(String arg) {
        if(arg.isBlank()) {
            return dependency.aCall(System.currentTimeMillis());
        } else {
            return dependency.anotherCall(arg);

We’ll extract the access of these dependencies into methods:

public class ExampleClass {
   private Dependency dependency;
   protected Dependency dependency() { return dependency; }
   protected long currentTimeMillis() { return System.currentTimeMillis(); }
   public String method(String arg) {
       if(arg.isBlank()) {
           return dependency().aCall(currentTimeMillis());
       } else {
           return dependency().anotherCall(arg);

Now we can do this in the test:

public class ATest {
   class TestableExampleClass extends ExampleClass {
       protected Dependency dependency() {
           return new FakeDependency(withSomeTestSetup);
       protected long currentTimeMillis() {
           return 42;
   void test() {
        var sut = new TestableExampleClass();
        assertEquals("bar", sut.method("foo"));

This approach is best suited to cases where the dependencies are numerous, and they’re used a lot around the class. If the dependency use is limited to one method, a simpler alternative is to create a wrapper with the original interface and rename/overload the original with injectable version:

public class SomeClass {
    private Dependency dependency;
    public String method(String arg) {
        return methodImpl(arg, this.dependency);
    public String methodImpl(String arg, Dependency dep) {
        return dep.anExampleCall(arg);

Observing output and recording the results

Write a test that sets up the dependencies and calls the code with some wisely chosen input values – look at the code under test for hints which values to try first. Values returned from dependencies count as inputs too. The test should assert the output against some “empty” expected value like zero or an empty string, it does not matter for now.

The goal here is to run the test (expect it to fail), observe the actual result from the log and copy it into the expected value of the assertion. Note that exceptions are outputs too, you might need to assert them using assertThrows or similar! At first, this may sound weird to use the actual value as expected, but this is the process of recording the existing behavior in form of input-output pairs into the test for later runs.

Depending on the language, it might be useful to convert the object-valued outputs to strings for easier comparison. Be careful though. For example in Java toString is often (incorrectly) used in the application logic, and may not represent complete state of the object. Also note that changing toString implementation may break code elsewhere!

Using code coverage to guide the exploration of input space

After copying the actual output to the expected value, run the test again (should now succeed) and check what the coverage is using your coverage tool.

If you find uncovered code (and most likely you do), try to write another test in a similar way and modify the inputs such that the uncovered code gets covered. Research the code to guide your inputs, until the input space is sufficiently explored to maximally cover the code.

You may find it impossible to cover every single branch; it’s called dead code. That’s normal and even expected from legacy code that has not been written test-first. Nevertheless, don’t remove them yet.

Generating test cases automatically

In some cases, it may be too tedious to cover all branches by hand. In this case it could be feasible to generate lots of test cases programmatically and save the results to a file along with the corresponding inputs. The test would then read the input-output pairs from the file, execute the code and compare against the expected value.


There are various tools for storing the outputs between test runs, like jApprove and “approvals” for JavaScript.


Let’s have a look at a real-world example:

import { DateString, getDateString } from "@report/types";
import {
} from "date-fns";
import { getHolidays } from "./getHolidays";
export const getMonthsDates = (date: Date) => {
  const year = getYear(date);
  const firstDate = startOfMonth(date);
  const lastDate = endOfMonth(date);
  const days: DateString[] = [];
  const weekdays: DateString[] = [];
  const holidays = getHolidays(year);
  let currentDate = firstDate;
  while(currentDate <= lastDate) {
    if (!isWeekend(currentDate) &&
        !holidays.some((h) => h.date === getDateString(currentDate))) {
    currentDate = addDays(currentDate, 1);
  return { days, weekdays };

It isn’t too bad, but could definitely be simpler with some functional programming style.

Let’s find the dependencies. It seems that the getMonthsDates function has an external dependency to getHolidays that might make the results indeterministic. So we’ll use a small wrapper to make the dependency injectable from tests and maintain the interface to other parts of the codebase:

export const getMonthsDates = (date: Date) => {
  return getMonthsDatesImpl(date, getHolidays);
const getMonthsDatesImpl = (date: Date,
    getHolidays: (year: number) => { date: DateString }[]) => {
  const year = getYear(date);
  const firstDate = startOfMonth(date);
  // ...

We’ve replaced the getMonthsDates with a small wrapper function that injects the getHolidays dependency, and moved the implementation to getMonthsDatesImpl that can be provided with a mock version.

So let’s write the test! The input to the function is a Date object, in addition to the dependency, a function that returns holidays for the given year.

describe('getMonthsDates', () => {
   it('should preserve its behavior', () => {
      const holidaysFn = (year: number) => [
        {date: '2022-01-06' as DateString}
      const result = getMonthsDatesImpl(new Date("2022-01-01"), holidaysFn);
      expect(result).toEqual({});  // {} will be replaced with actual value

If we run it, the expectation fails. Instead of empty object, the result is:

    days: [
      '2022-01-01', '2022-01-02', '2022-01-03',
      // ...omitted for brevity
    weekdays: [
      '2022-01-03', '2022-01-04',
      // ...omitted for brevity
      '2022-01-28', '2022-01-31'

Let’s copy the expected value to replace the empty object in the assertion. The test now passes. Checking the coverage, it seems that we got lucky – the line and branch coverage are all 100%.

Now we can start making very small incremental changes step by step, and run the test frequently, resulting a successfully refactored version:

export const getMonthsDatesInt = (date: Date, getHolidays: (year: number) => {
  date: DateString;
}[]) => {
  const days = eachDayOfInterval({
    start: startOfMonth(date),
    end: endOfMonth(date)
  const holidays = getHolidays(getYear(date));
  const isHoliday = (day: Date) => holidays.some(h => h.date === getDateString(day));
  const isWorkday = (day: Date) => !(isWeekend(day) || isHoliday(day));
  const weekdays = days.filter(isWorkday);
  return { days: days.map(getDateString), weekdays: weekdays.map(getDateString) };