Data Cleaning Best Practices
Start by getting in a data mindset and understanding how to bulletproof your data. Ask these questions:
- Who collected the data?
- Why did they collect it?
- How did they collect it?
- When did they start (and stop) collecting it?
- What happened to the data between the day it was born and the day you are harvesting it?
Get a data dictionary (or make your own)
“A data dictionary is a file that explains what each column header in your spreadsheet is, along with information on the source of the data. …
“The data dictionary can contain (but is not limited to):
- Field name: these should be clear. If you use headers like revenue1 and revenue2, chances are you probably won’t remember what they are. Save yourself time and make them concise and make sure to write this up in the data dictionary.
- [Original field name: you may find that you need or just want to rename the field to make it easier to work with. In that case you’ll want to have a place to retain the original field name] – rt
- Meaning – what does the field name mean
- Format – the format of the file: text, integer, float, date format
- Source – where the data comes from. This is especially helpful if you are combining multiple data sources.”
- For each field, create a list of the values you find in the field (or the range of values if it’s not practical to list every unique value). Make sure you know (meaning you have at least one reliable source other than yourself) what each value means. For example, does “UN” mean “unknown” or “unaffiliated”. If you’re working in a spreadsheet, pivot tables are a great way to see all the unique values in a field. In a database, you can use SELECT DISTINCT to find the unique values and GROUP BY to find the frequency of each value.
If you get a data dictionary from the source, fact-check it. If you are making it yourself, go ahead and clean it up along the way. Some things you will want to do:
- Make sure there are not multiple tables in a single sheet. For example, if you have data about school budgets, you don’t want multiple tables for each school, with the values for “School A” in cells A1:D10, values for “School B” in cells F1:I1, and values for “School C” in cells A11:D20.
- Make sure all fields are the correct format (numerical, text, and especially dates)
- Relocate any “metadata” that is being used in the dataset.
- Remove all rows that are entirely blank.
- Remove all rows that are aggregations (totals, subtotals, etc.)
- Edit header names so that they begin with a letter, contain no spaces and are relatively brief. (If you change a header name, be sure to record the original name.)
- Make sure “state” and “city” fields (if present) are separate.
- Separate names into first, last, middle and suffix (such as “Jr.” or “III”). You may consider keeping a “full_name” field to more easily distinguish between similarly named people.
- Separate addresses into “street”, “city”, “state” and “zip code”. Just like name, however, consider keeping a “full_address” field. This advice comes with the caveat that you sometimes need apartment or office numbers, or don’t need states, or might need to separate out the “street” field into more atomic pieces.
Recipes for cleaning data
- “Empty” cells. Not all “empty” cells are created equal. Cells with a blank space (“ “) in them are different from cells with no value. The first thing you might want to do is a find-and-replace to switch all “ “ cells to “”. Then, you can sort each column in ascending order. The “ “ space cells will appear at the top of the list and the “” cells will appear at the bottom. But, in order for this to work properly, you need to select the entire column and then use the sort button in the menu bar. If you try to use the filter dropdown in a header row, the sort will only affect cells above the first blank cell. If you are working in SQL, you want to set all cells that contain an empty string to NULL values by doing this. In OpenRefine, you can set all blank cells to NULL values when you import your data.
- Highs and lows. Sort each numerical and date field so you know what the maximum and minimum values are for that field. (You might also enjoy installing csvkit and the using the csvstat command on your file.)
- Count the frequency of values in each field using either pivot tables in a spreadsheet or GROUP BY in an SQL database. Is there the full range of values you would expect? This is easiest when you have a dataset with well-known, consistent and finite values. For example, are all 50 states present? Are all 100 North Carolina counties present? And are there between 32,000 and 42,000 zip codes in your national dataset? (It depends on which definition of “zip code” you are using…)
- At the same time you are looking at whether the collection of values is complete, also look at the frequency of each value in each field. Do they pass the sniff test? Does one day of the year account for more than 0.5% of the rows in your data? Are records from Rhode Island more frequent than records from California? In either case, you may have a great story but you first need to double-check to make sure you don’t have bad data.
The first rule of cleaning data should be “do no harm.” If you are transforming data — especially something like de-deduplicating names or standardizing industries — do not destroy your original data and don’t change anything unless you can justify the change with an independent source. Don’t make assumptions that things that appear to be the same are actually the same. OpenRefine is a great tool for data cleaning because you can easily roll back changes and create new columns of clean information while preserving the original data.
- In many situations, you would like to avoid worrying about case sensitivity. For example, you probably want “Spring hope” to be the same thing as “SPRING HOPE” and “Spring Hope”. In this case, you’ll want to convert all your text fields to all upper-case characters. You can do this by using the UPPER function in either Excel or SQL. In Open Refine, you can do this by using value.toUppercase()
- State names: Make sure that all state abbreviations are consistent throughout the dataset. To a computer, “NC” and “N.C.” and “North Carolina” are all different. I recommend converting all states to two-letter postal abbreviations. (If you need AP style state abbreviations or full state names later during the presentation phase, create a lookup table in a database or another field in your spreadsheet that has the version you need.)
- Zip codes: Does your data include five-digit zip codes, zip+4 zip codes or both? You can easily trim off the last four characters of a zip code, but you can’t add them unless you look up the street addresses somehow. Also, be sure your zip codes are not formatted as numeric values — 90210 is not the same thing as 90,210.
- Numbers: Make sure numbers are not formatted as text strings (otherwise 100 will come before 11 when you try to sort them) and that there are no commas or decimal points being stored as characters in the field. (Usually the quick fix for this is opening the file in a spreadsheet and converting the field to a numeric type.)
- Do you have multiple values in a field that might mean the same thing? For example, in data about North Carolina schools you might find that “Springhope” and “Spring Hope” and “Spring Hope Elementary” and “Spring Hope Elementary School” and “Springhope School” are actually all the same place (or maybe not). With people (or anything named after people) that “GK Butterfield” and “G.K. Butterfield” and “Butterfield” and “George Butterfield” and “George Kenneth Butterfield Jr.” all mean the same thing (or maybe not). Descriptive categories of crimes from local police agencies are also notoriously inconsistent. In cases of schools or crimes, ask an expert to find out if there is a canonical list of valid values, and whether that list might also come with a unique identifier code that you can use instead of descriptive names. Finally, if you have two values that appear to be the same, one of them may have a stray space at the beginning or end of the word. In that case the TRIM function (same in both Excel and SQL) will help remove those stray spaces at the beginning and end. If the extra spaces are in the middle of the value, try a Find + Replace. (Or value.replace() in OpenRefine). If you are working in OpenRefine, the value.trim() and removes leading and trailing whitespaces, and value.strip() collapses consecutive whitespaces in the middle of the string. Searching for misspellings is a great job for OpenRefine’s faceting and clustering tools.
Other metadata before you get started
- If your data contains currency values, what year are the values from? Have they been adjusted for inflation or not?
Another great resource for “fact checking” a dataset is Chapter 5 of the “Verification Handbook for Investigative Reporting” by Giannina Segnini.