A Data Science Technique for Cleaner Data Preparation

As data scientists, so much of our time is spent on cleaning and preparing our data. The analysis and model building may be the most fun and insightful part, but before we can do that, we have to ensure that we have quality data coming in.

What I have seen in my experience is that most statisticians and analysts will do all of their cleaning, prep work, and the analysis itself on a single-process script that reads from top to bottom. While this is certainly one option for us in how we do our work, it often leads to messy, unmaintainable code that is difficult to communicate with other members of the team. A key goal of any team of analysts/programmers is learning to communicate well. And if code is our primary language, we really ought to make sure that we are coding with the best data cleaning practices.

In my job as a data scientist, I have had the privilege of working with more senior developers who have passed on lots of great lessons to me. I’ve applied some of these practices to my life as an analyst, in particular, regarding how I clean and prepare data. Today I am going to pass on one of those lessons to you. Namely, we’ll see how building a class in Python can help to organize our data preparation.

First we’ll explore how most analysts would perform a data cleaning process. Then we’ll observe a better way.

Method 1 (BAD): Writing a Single Process in a Script for Data Preparation

The following example shows how the typical analyst might prep data. This method is faster to code in the short-run, but will be slower and more cumbersome in the long-run. 

import pandas as pd
# Data Cleaning

# Create data frame
df = pd.DataFrame({
        'string_col': ['A', 'B', 'C'],
        'bool_col1' : [False, True, True],
        'int_col' : [1, 2, 3],
        'bool_col2' : [True, True, False]}

# Set boolean variables to process
vars_to_process = list(df.columns[df.dtypes == 'bool'])

# If there are variables to process, do some processing.
# Here, we just set the variables to 12345 for illustration purposes.
if(len(vars_to_process) > 0):
    df[vars_to_process] = 12345

# Set a sorting column
sorting_col = 'string_col'

# If there is a sorting column, prep it.
if(sorting_col != None):
# To prep our sorting column, we lowercase values and then triple them
    df[sorting_col] = [x.lower() for x in df[sorting_col]]
    df[sorting_col] = [x*3 for x in df[sorting_col]]

Notice, the above code above does work. And to most people that is good enough. It is certainly not a bad start. Most analysts and statisticians would see code like this and wouldn’t bat too much of an eye.

However, notice a few issues with it:

1) The sorting column (sorting_col) is just kind of hanging out in the middle of the script. If one wanted to change that value in the future or use it again later in the script, they’d have to navigate through the code to find that global variable before they could reinitialize it. They’d also have to remind themselves of what it represented, which requires reading it again. This would be quite lengthy and annoying. And it would definitely make the program more vulnerable to human error. This is true even for the person who wrote the script, but especially for other members on the team who may be reading the code.

2) Also, note that no parts of the above process above were encased in a function. This means if an error occurred, we’d have no good way of handling that error or debugging our process. In addition, it requires people know that checking “if the length of vars_to_process is greater than 0” means it is checking whether the list of variables is empty. It would be nicer to explicitly spell this out with the code itself rather than having to manually add a comment.

3) Finally, if we wanted to use the sorting column variable later in the analysis, we may forget that it belongs to the data frame or is a meaningful attribute of it. To that end, it would be nice to have a way to keep it conceptually “linked” to the data frame.

In sum, though it may be faster to do our analysis like this, in the long-term these problems that we have identified could lead to lost time and work.

Now we’ll look at a technique that can make our code more maintainable, less prone to error, and easier to understand.

Method 2 (GOOD): Writing a Class for Data Preparation

The following example shows an improvement on our previous method. This shows the value of being a statistician that can think like a developer (i.e. this makes for better data scientists). With re-use this method would yield long-term returns on your time and productivity.

class DataPrepper():

    def __init__(self, df, vars_to_process=[], sorting_col=None):

        self.df = df



    def there_are_vars_to_process(self):
         return (len(self.vars_to_process) > 0)

    def process_vars(self):
    """To process a variable, set each value to 12345."""
        self.df[self.vars_to_process] = 12345

    def there_is_a_sorting_col(self):
        return (self.sorting_col)

    def process_sorting_col(self):
    """To process a sorting column, first lowercase values and then triple them."""

    def lowercase_vals(self):
        lower_vals = [x.lower() for x in self.df[self.sorting_col]]
        self.df[self.sorting_col] = lower_vals

    def triple_vals(self):
        triple_vals = [x*3 for x in self.df[self.sorting_col]]
        self.df[self.sorting_col] = triple_vals

Notice we now have a class with variables and functions that are centered on data preparation. And note how it has solved each of our problems from before:

  1. It has a few instance variables, such as sorting_col, that we can use inside any functions in our process. In addition, per convention, they are located at the top of the class, meaning other programmers can easily access them if they need to.
  2. We also can see how these functions make our process much more readable and maintainable. Other members of the team can read this and generally understand the nature of the process. The code now reads almost like a story (e.g., ‘If there are variables to process, then process the variables.’). By the way, I picked up this tip from reading Robert Cecil Martin’s “Clean Code” (which I highly recommend). And now where applicable I always program with this level of readability.
  3.  Finally, because the sorting column (sorting_col) is stored as an attribute of the prepared data, if we ever do need it in the future, we can access that attribute easily and know that is an associated component of the data preparation process.

For each of these reasons, our code is now less prone to error, more maintainable, and easier to communicate to our team.

Now preparing the data is as easy as creating a DataPrepper object.

prepped = DataPrepper(df=df,
                      vars_to_process=['bool_col1', 'bool_col2'],


Concluding notes

As an analyst, there is so much more to our job than just crunching numbers and reporting results. Spending more time initially to develop better data preparation techniques will pay long-term dividends to you, your team, and ultimately your organization.

Note: while the above technique improved our process, of course there are additional enhancements that we can make. In future posts, we will learn about the value of using try/catch blocks, using configuration files to specify our parameters, and some conventions for naming variables and functions. All in all I will continue to spotlight great tips that will make you a better data scientist!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s