How we helped a journalist fact-check with data
Recently, Carolina Data Desk received a request from a journalism grad student looking to analyze a data set to fact-check a claim for her thesis. She was familiar with some data journalism techniques and knew that this was possible, but due to some computer issues was unable to perform the analysis herself.
This reporter was writing about the Durham County jail and wanted to verify this sentence: “Data show between June and July 2016, the majority of people in the jail were held on bail of less than $5,000.” She was familiar with this jail and was a bit skeptical of this claim, saying:
The conclusion above was surprising to me because I spend a lot of time looking at the jail list, and usually it seems that a majority of the people currently incarcerated have bonds higher than $5,000. I think the statement may be false or misleading, that whoever crunched this may not have dealt with the fact that many people with low bonds were in the jail for 1 day or less – not exactly members of the jail population – and many are being held on no bond (which registers as $0). My hypothesis is that whoever came to this conclusion was including people who were booked, but bailed out within 24 hours.
She wanted to check the percentage of people having bail of less than $5000, both for all bookings at the jail and for only people staying longer than a day. Fortunately, she already had the data necessary to investigate in the form of a csv containing data on the jail’s inmate population with bond information. She provided me with several clarifications about the data to aid me in my analysis. These were:
- Some bookings involved multiple charges which are included as multiple rows in the data. She was interested only in the sum of the bond amounts for each booking, as this sum must be paid before the individual is released.
- Many of these duplicate records include no bond amount.
- Records with “No Bond” or “NB” should not be included in the count of people having bail less than $5000, as no bond means they are not allowed to leave jail on bond, not that they have a bond of $0. However, these should still be included in the total jail population.
- Loaded the tidyverse package. This package contains many functions useful for cleaning and organizing data. If you have never used this package before, you will have to install it before loading it using the command install.packages(“tidyverse”).
- Read the csv file into R.
- Added a column to the data containing the full name of each person in the data set. In the original data, names are separated into three columns for first, middle, and last names.
- Converted the bond amount column to numeric. R read in this column as a factor due to the format, but we need it to be numeric to perform mathematical operations like addition.
- Split the data set into two parts: people with bonds and those with no bond. This step actually contained three sub-steps, as I also summarized each booking into one row in this step. In those sub-steps, I:
- Filtered the data set by bond type to include only those with bonds in the first case and only those with no bond in the second case.
- Grouped the rows by full name and date confined. I chose these two fields because eventually I wanted one row per booking, which should be a unique combination of name and date confined. If I used name alone, someone who was booked, released, and booked again days or weeks later for a different offense would be counted only once.
- Summarized the total bond amount, maximum days confined, and maximum days on charge for each booking. The days confined column should contain the same value for all rows from a single booking, but this is not the case for the days on charge column.
- Calculated the number of people with no bond, bonds less than $5000, and bonds greater than $5000 for all records and for only those people who were confined in the jail for at least two days. To do this, I filtered based on the total bond amount and/or number of days confined as necessary and then found the number of rows.
The Final Product
From my data analysis, I came up with the following numbers:
- There were 7921 total bookings in the data set (January to September 2015)
- Of those, 1333 had no bond, 1978 had bonds higher than $5000, and 4610 had bonds less than or equal to $5000
- Of the 7921 bookings, 3519 (about 44%) stayed in the jail for at least two days
- Of these longer-staying prisoners, 1242 had no bond, 1141 had bonds higher than $5000, and 1136 had bonds less than or equal to $5000
This aligns with the journalist’s suspicions regarding the original statement. Although the majority (58%) of all bookings result in bonds of less than $5000, of prisoners staying longer than a day, this group makes up a much smaller proportion (only 32%).