How to be assertive about not testing your data science pipeline

rstats workflow

An introduction to assertive programming as an alternative to testing

Miles McBain https://milesmcbain.xyz
2023-04-23
A solidly built pipeline, or is it?

Figure 1: A solidly built pipeline, or is it?

This is a short post introducing assertive programming as an alternative to traditional software testing that probably makes more sense for data science pipeline code.

Getting assertive

Assertive programming is a useful defensive programming technique where we try to identify problems with our data early, and throw an error rather than letting that problem propagate deeper into the program where it might cause all sorts of weird and wonderful bugs.

If you’ve ever written an #rstats function there’s a reasonable chance you made some attempt to validate your inputs. Congratulations you’re assertive programming. For example:

my_function <- function(name) {
  stopifnot(length(name) == 1 && is.character(name))

  ...do stuff
}

This type of example is common in R because all our basic data types are vectors, but a lot of the time we want to treat them conceptually as scalar values, and not have to worry about handling vectors of longer lengths. There are some nice helpers in {rlang} for performing this kind of check.

Lately, assertive programming has become of interest to me because I’ve realised that for data science pipeline code, it makes more sense to invest effort in assertive programming rather than traditional software testing. This has resolved a long-held awkward feeling for me about the thousands of lines of code per project that are written and tested interactively against datasets, but rarely have any more rigorous testing applied.

Prefer being lazy to absurd

The reason no tests ever got written on my pipeline code is that it felt too awkward. Pipeline code is not generic like the code that gets written into functions and packaged. It’s specific to the datasets at hand, and typically makes many implicit assumptions about their structure and contents. Coming up with variations of our data that we might need to handle in our specific sitiutation can border on the absurd.

For example, say we had a function in our pipeline that does some kind of classification or binning of data:

library(dplyr)
add_record_classification <- function(dataset) {
  dataset |>
    mutate(size = case_when(
      body_mass_g >= 4000 ~ "big",
      body_mass_g >= 3000 ~ "medium",
      body_mass_g < 3000 ~ "small"
    ))
}

This example is so simple we wouldn’t bother testing it, but imagine the classification was a bit more involved and the code makes us a bit uneasy.

If we tried the traditional approach and fired up {testthat} to write a test for add_record_classification() the very first hurdle we’d face is to create some test data. And this is where things get start to get awkward. We already have a dataset… so should we carve off some of that and use it as test data? That would be an oddly specific test. But then should we create a test dataset that is different from the dataset we know we plan to use for the sake of testing the function?

If you’ve done a lot of testing you might say, “Ahah but what if body_mass_g is NA?”, and that would be a useful thing to think about. But it turns out our current dataset doesn’t have any NAs for now. Should we invent a test dataset that contains NAs we know we don’t have for the sake of it? That would force us to write new code to address a problem we don’t have.

I mean sure the data might change, that’s a valid perspective. But writing robust NA handling for the sake of passing a test, for a case we know we won’t encounter for real now is very speculative. In my opinion, it’s better to take a lazy approach, and put in the NA handling code if and when that situation arises. Assertive programming is how we detect when that situation arises:

add_record_classification <- function(dataset) {

  if (anyNA(dataset$body_mass_g)) {
    rlang::abort("NAs are present in 'body_mass_g' column")
  }

  dataset |>
    mutate(size = case_when(
      body_mass_g >= 4000 ~ "big",
      body_mass_g >= 3000 ~ "medium",
      body_mass_g < 3000 ~ "mmall"
    ))
}

If you’re scoffing, remember this is pipeline code. It’s designed to take a specific known dataset and generate some specific desired output. There is not a universe of possibilities to handle here. If the code genuinely is generic, then it likely belongs in a package, where traditional testing makes sense.

You may agree with my approach but be uncomfortable with being so lazy because your pipelines are a long-running process. It would suck to get that error 30 minutes into a run - that’s for sure. But if that’s the case I’d argue you can do a lazier and better job of things by converting your pipeline to use {targets} and then that problem goes away. You can resume processing from close to where the error occurred.

{testthat} as an assertive programming tool

There are lots of great tools in R for assertive programming. An interesting find for me is that {testthat} is incredibly useful in this regard, due to the detailed errors it gives when it finds conditions that aren’t met. For example, quite often after combining a bunch of datasets or sending them through a pipeline of manipulations, I want to assert that I have not inadvertently changed the length of the output dataset either by accidentally dropping rows or accidentally introducing duplicates:

library(testthat)
make_my_rectangle <- function(dataset_a, dataset_b, dataset_c) {

... Do stuff

  expect_equal(nrow(output_dataset), nrow(dataset_a))
  expect_false(any(duplicated(output_dataset$id)))

  output_dataset
}

If the number of rows was changed we see:

Error: nrow(output_dataset) not equal to nrow(dataset_a).
1/1 mismatches
[1] 200 - 344 == -144

{testthat} adds a nice amount of numeric detail to the error report.

These assertions become useful yardsticks when working backward from a bug in the pipeline to find the source. They save you from loading up the intermediate datasets from your {targets} cache to check this stuff manually. The code they relate to can be excluded as a possible source by eye.

Conclusion

I’m really enjoying {testthat} for my in-pipeline assertive programming at the moment. I can use the same kinds of patterns I use in my package code tests, which reduces my cognitive load as I context switch. And the fact that one package can span both the testing and assertive programming use cases possibly hints at a deeper relationship between the two. In my opinion, assertive programming can be employed as a kind of dual of traditional software testing, conveying equivalent benefits in the right context.

There’s one significant advantage assertive programming has over traditional testing for pipelines that is worth mentioning and remembering if you’re ever challenged about it: by design, software tests run outside of the ‘production’ pipeline code. The test suite is separate from the main codebase that gets the work done. A consequence of that is even though the tests exist, there is nothing that guarantees they all passed, or indeed even ran, before the pipeline code was executed. By moving the test conditions into the pipeline and expressing them as assertions, we guarantee all the conditions were met, otherwise, the pipeline would have failed to complete. This is a very useful guarantee from a reproducibility perspective.

Some other packages I’ve used, and recommend for assertive programming:

Tools suggested by #rstats community members on Mastodon:

In conclusion, I hope I’ve helped you feel better about not testing your pipeline code. You know what to do. Be assertive about it!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://www.github.com/milesmcbain/website_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

McBain (2023, April 23). Before I Sleep: How to be assertive about not testing your data science pipeline. Retrieved from https://milesmcbain.com/posts/assertive-programming-for-pipelines/

BibTeX citation

@misc{mcbain2023how,
  author = {McBain, Miles},
  title = {Before I Sleep: How to be assertive about not testing your data science pipeline},
  url = {https://milesmcbain.com/posts/assertive-programming-for-pipelines/},
  year = {2023}
}