Unit testing has been a topic of debate and discussions within the software development community about what unit testing is, the best practices and the role that unit tests should play in the development process.
Although it is common practice to include a testing phase in the development process, unit tests subject specifically can effectively split developers into two groups of staunch opponents.
One group will be strong advocates, praising its benefits and trying to show that unit testing is really necessary. The others will try to prove that doing unit tests is a waste of time, or, at least, not as efficient as it should, given the effort it requires.
In this article, we will dive deep into the concept of unit tests itself and in the context of Data Science and Data Engineering projects (you can read about the difference between Data Engineering and Data Science here).
Once you read it, you will know:
- what is unit testing,
- why to use unit testing in Data Science and Data Engineering,
- when to use unit testing in Data Science and Data Engineering,
- what are the best practices for unit testing,
And finally: is unit testing worth it or is it a waste of time?
What is unit testing?
Generally, tests are functions or pieces of code created to examine the main body of code of functions and to ensure that they work properly, meaning – as expected.
A unit test is a special test type that you narrowly scope to examine one specific piece of the existing functionality of the code. As a result, it examines how one specific function behaves in one specific scenario.
- if you have a function that multiplies an integer by 5, you can write a unit test that gives the function the number 2 as an input and verify that it returns 10,
- if your function should return the number of vowels in a word, you can create a unit test that passes the function the word “hello” as an input and checks that it returns 2.
There are a few useful paradigms to better understand what should occur within a unit test.
What are unit testing paradigms?
Paradigms are different approaches to organizing and structuring test code. Each of them comes with their own strengths and weaknesses. The best approach is to choose the one that will satisfy testing needs impacted by the specifics of the project.
Test-driven development (TDD)
The TDD entails writing tests in your test suite before writing the production code. The idea is to:
- start with writing a failing test,
- write a minimum amount of code to make the test pass,
- refactor the code as necessary.
When you write your code, you just repeat this cycle for new features or behaviors that you implement.
Behavior-driven development (BDD)
The BDD is similar to the TDD, but it has a more holistic approach and focuses on system behavior as a whole rather than examining individual units of code. Such tests are usually written in a more human-readable syntax, using Given-When-Then statements to describe the behavior being tested.
Test case design techniques
The test case design paradigm is focused on creating test cases based on various criteria, such as boundary values, equivalence classes, and decision tables. The main idea is to design tests that are likely to detect defects in the code and allow you to act upon them.
This paradigm encompasses:
- start with defining properties that the code should satisfy,
- generate random inputs to test whether those properties hold true.
As a result, property-based testing is especially useful for finding corner and edge cases that may not be discovered by traditional test case design techniques.
Mocking and dependency injection
Finally, the last paradigm example entails techniques for isolating units of code from their dependencies. Mocking phase involves creating fake objects or functions to stand in for real dependencies, while dependency injection encompasses passing dependencies in as arguments or setting them up in a test environment.
As already mentioned, there is no one universal best approach to performing unit tests. Consequently, you need to take into consideration what you test, what features it has, and what outcomes you expect from testing, to be able to choose the most appropriate paradigm.
Nonetheless, regardless of the paradigm you choose, you can apply one common pattern, which is the Arrange-Act-Assert (AAA).
What is Arrange-Act-Assert pattern in unit testing?
This pattern is a common way for structuring unit tests. It breaks unit tests into three sections:
- Arrange creates the preconditions and inputs for common tests, such as necessary object creation, initialization, or configuration,
- Act executes the code tested, usually by calling a method or a function,
- Assert verifies that the expected results of the test have been achieved by comparing the actual output or state of the system to the expected output or state.
Therefore, coming back to our examples and the function that multiplies integers by 5:
- first, you Arrange the environment by defining a variable with the value 2,
- second, you Act by running the function to the variable you defined,
- lastly, you Assert whether the output produced is 10 as expected.
Why to use unit testing – Data Science and Data Engineering?
Knowing what unit tests entail, let’s move to why it is worth it. In fact, there are a few reasons to incorporate such tests into your Data Science and Data Engineering codebase.
Firstly, unit tests help to verify that all:
- individual functions,
are working as expected.
It is particularly important for Data Engineers in compound data pipelines and Data Scientists in machine learning models. If you test each piece of code in isolation, you can detect errors (for example, caused by data quality issues or a small missed error in the development) and ensure that the overall system behaves correctly.
Another obvious benefit of writing unit tests is faster and easier debugging. In this case, they prove to be especially useful in detecting the reason behind a specific bug. Since every test checks how one specific function or module performs in one specific environment, all you need to do is look which unit test failed to determine where the error is and fix it.
As a result, unit tests are a great debugging tool. It allows you to isolate the source of errors, unexpected output or bugs and increase the quality of your code.
Unit tests also provide the possibility of developing and testing code more effectively and quicker. Instead of performing manual tests, which are complex and time-consuming, you can speed up the process by using small, isolated pieces of data.
Refactoring with more confidence
What is more, unit tests provide you with confidence to refactor your code. They ensure that changes to one part of the codebase won’t break other parts of the system. By re-running unit tests after each test, you can guarantee that the code continues to behave correctly and that the existing functionalities in your data pipelines and models, are intact even if you modify the code.
Unit tests can also facilitate collaboration between team members by providing a common set of standards for how code should behave and . As a result, other developers can look at test files to understand what to expect from certain functionality before diving into details of the code.
Finally, if you write unit tests, they can serve as a form of documentation. They provide specific examples of how the code should be used and what inputs and outputs are expected. By scrutinizing the tests, other developers can gain a deeper understanding of the code.
When to use unit testing for Data Science and Data Engineering codebase?
Although unit tests can be used by Data Scientists and Data Engineers whenever possible, there are a few key situations when they are particularly important:
- building production code – enabling to detect and eradicate bugs before the code is released as the production data,
- refactoring code – ensuring that changes won’t introduce an error or break existing functionalities,
- complex data pipelines in Data Engineering or machine learning models in Data Science projects – ensuring that each component behaves as expected,
- collaborating with others – allowing others in a team to get familiar with the codebase and get up to speed faster,
- jobs taking a long time to run – helping to detect bugs early and preventing you from waiting a lot of time to find a job failing due to a bug in your post-processing code.
In general, you can use unit tests as often as possible in Data Science and Data Engineering codebases. Even simple tests can be a great tool to verify that new test functions or new features don’t produce unexpected output or bad data.
Therefore, even if your case is not listed, remember that it doesn’t mean that writing unit tests won’t be useful.
What are the best practices for unit testing for Data Science and Data Engineering?
To have a full package for unit tests in Data Science and Data Engineering, let’s review a few best practices. They will help you to exploit unit tests to the fullest and decrease the risk they will be just a waste of time.
The common best practices entail:
- writing maintainable tests that are easy to read and update – descriptive test names, organizing tests into logical groups, avoiding unnecessary complexity,
- creating deterministic tests – you should expect the same results if you rub the code repeatably in the same environment (Data Science projects or codebases are full of stochasticity, so it is especially important for Data Scientists),
- writing fast tests – you should make sure that tests can be completed in a short amount of time, especially in the case of a complex data pipeline in large projects or machine learning models,
- considering common data quality issues like duplicated or missing values that should be handled with your code,
- testing in isolation – you should test only one feature per test – for example, if you test data quality, you should test missing values with one test and duplicate records with the other.
Is unit testing worth it or is it a waste of time?
As you can see, unit tests seem to be quite a powerful concept. It can help you in building bug-free, high-quality code for your Data Science solutions and Data Engineering projects with confidence.
The major complaint against it is that the effort required to write and maintain tests may outweigh the benefits offered. However, this is not always the case. It all depends on the specific circumstances of the project and the development team.
In general, the value of writing unit tests comes from the fact that it allows you to:
- catch errors early,
- ensure code quality,
- reduce debugging time,
- facilitate collaboration,
- support refactoring.
These benefits can save a lot of time and effort in the long run. It is particularly important for complex projects or large projects that are likely to change over time.
As a result, there may be cases where the effort required to perform unit tests is greater than the benefits. This could be the case for small projects, or projects where the code is unlikely to evolve over time. In these cases, it may make more sense to focus on other forms of testing or data quality assurance.
All in all, the decision of whether to use unit tests or not should be based on a careful consideration of the costs and benefits, as well as the specific needs and goals of the Data Science projects or a Data Engineering data pipeline and the development team. While doing unit tests is a valuable practice, in certain cases, it may not be the best approach.