Data Preparation

Published

April 20, 2025

Modified

May 22, 2025

Inconsistent reporting standards between different sources constitute the first hurdle to using the data; the data sets from different agencies vary in terms of columns, names of columns, formats of date/times, coordinate reference systems (CRS), treatment of missing values, (lack of) treatment of duplicates, shapes (long versus wide), file types, etc. Sometimes even for the same agency, the reporting standard changes multiple times over the years. Despite the code written to deal with all sorts of edge cases being quite interesting, this section will mainly focus on explaining:

what columns/variables are included for analysis and how different data sets can be harmonized,
the treatment of missing values and what effort has been done to minimize data loss.

Unique incident identifiers

Very often the data comes with one or multiple columns that could identify an incident. One common type of ID is an ObjectId that exists within a database table, this type of ID is unique within a file but may duplicate each other if data is stored across multiple tables/files (such as yearly tables). This type of ID is not particularly informative either and depending on how data is inserted into the database, two different IDs may even point to the same incident, thus it is usually avoided when possible. Another common type of ID is a Case ID or Incident ID that police departments (PDs) use for internal record keeping (and is not standardized across different agencies), this type of ID is informative and duplicates can usually be traced to one of the following reasons:

Complete duplicates of the entire row - this happens most likely as a result of data handling oversight, meaning the same data was inserted twice (or more) without controlling for uniqueness. In this case the last occurrence is kept (or any one occurrence can be kept).
Duplicates of the same incident but with different populated columns - this typically happens when adding additional information to an existing case creates a new row, or when rows are not based on incidents but some other subjects, such as 911 calls (multiple calls pointing to the same incident) or victims (multiple victims in one crime). In this case only the last occurrence is kept so the crime is only counted once.
Multiple crimes for the same incident - this typically happens when multiple offenses are perpetrated in the same incident and each crime would have its own row but share the same case number. In this case all occurrences are kept and an alternative ID would be used or generated, typically by concatenating the case number and offense type so that each offense type is identified once per incident.

Reporting consistency matters for data quality

The treatment of duplicates varies across data sets to offset the different reporting practices adopted in the first place, and it is assumed that for one given agency, no matter which approach they use, their practice is consistent over time or at least consistent within files that share the same columns - sometimes agencies change their reporting standards and explicitly distinguish between “current” and “historical” data, in this case there will not be a problem as old and new data sets are handled as if they come from different agencies.

In each step of the cleaning process, the number of rows dropped are monitored and will be looked into if an abnormally large number of rows are lost; these steps are i) checking for duplicates, ii) mapping coordinates to census tracts, iii) filling incident dates, and iv) filling crime categories. In steps iii) and iv), the go-to response to missing data is to coalesce information from other columns, and rows will only be dropped when absolutely necessary.

Another informative ID would be one at the FBI level, such as an NIBRS ID or UCR ID, as it is standardized and unique across agencies, however, many agencies do not participate in NIBRS and thus an agency level Case ID remains the most practical.

Column harmonization

For each data set, in addition to an ID column, various essential and nice-to-have columns are identified and converted to a common set of column names and data types. Essential columns must be present for the data set to be included in the project. Nice-to-have columns can be missing, nonetheless they are identified for possible future use. The following tables present a (non-exhaustive) list of harmonized columns, with name, type, and description.

Name	Type	Description
`CaseID`	`character`	Incident identifier

The essential date-time columns are either Date (for date) or POSIXct (for date-time) objects storing incident date or report date; they cannot be all missing; incident date is preferred to report date, but in case the former is missing, the latter will be used.

Name	Type	Description
`IncidentDate`	`Date`	Date of offense, typically in the form of “ymd” or “mdy”
`IncidentDatetime`	`POSIXct`	Date-time of offense, typically in the form of “ymd_hms” or “mdy_hms”
`ReportDate`	`Date`	Date of report
`ReportDatetime`	`POSIXct`	Date-time of report

The essential crime category columns are as follows; IncidentCategory is supposed to capture the column of crime categories that aligns best with the FBI Offense Definitions 2019, and IncidentDescription is supposed to capture the column of most granular offense description when available, however, if only one column of crime category is available, it will be taken automatically as IncidentCategory. Ideally the categories should be of factor type, however, in practice they are essentially character type, allowing entries of any text.

Name	Type	Description
`IncidentCategory`	`character`	Broad crime category, such as “Larceny Theft”
`IncidentCategory2`	`character`	Finer crime category OR alternative column
`IncidentCategory3`	`character`	Similar to above
`IncidentDescription`	`character`	Finest crime category

The essential geometry columns could be coordinates, sfc_POINT, or some city-specific grids that can be mapped to the census tract level (which rules out large geographic areas such as zip code).

Name	Type	Description
`X`	`double` or `int`	Some sort of X-coordinate, such as longitude or easting
`Y`	`double` or `int`	Some sort of Y-coordinate, such as latitude or northing
`Point`	`sfc_POINT`	Simple feature (`sf`) `POINT` object containing XY
`Geocode`	`character` or `numeric`	City-specific geo-codes of crime incidents

Technically a street address can also be converted into geographic coordinates (known as “address geocoding” or simply “geocoding”), but cities with only street or block addresses are not included at the moment due to techincal (financial) hurdles.

Name	Type	Description
`Intersection`	`character`	Street address, typically given as the nearest intersection or 100 block

Coordinate reference system (CRS)

Coordinates could come in many different systems, one of the most commonly known is the spheroid World Geodetic System (WGS 84). But for US-based agencies, coordinates sometimes come in as the projected, two-dimensional Cartesian system, the North American Datum 1983 (NAD 83). For example, the coordinates of Millennium Park in Chicago can be represented as 41.8826°N, 87.6226°W (WGS 84) or 1177789 East, 1900624 North (NAD83/Illinois East ftUS EPSG:3435).

Some nice-to-have columns could be,

Name	Type	Description
`Level`	`factor`	Whether the crime is a felony or a misdemeanor
`Property value`	`numeric`	If property crime, value of damage or loss
`Resolution`	`factor`	Resolution of the incident, such as whether arrest is made
`Premise type`	`character`	Incident location, such as store, residence, or street

Victim information (age, gender, race, etc.) could also be helpful in unpacking whether certain demographics are disproportionately affected by crime. In general, report of these nice-to-have columns is much more inconsistent across agencies, however, if the agency reports to NIBRS, then it is more likely that these information can be obtained and linked to the geocoded data.

Census tract mapping

Census tract is chosen as the geographic unit of analysis to strike a balance between practicality and economic meaningfulness. US Census tracts are delineated every 10 years (at the beginning of a decade, 2010s and 2020s and so on) to be zones with a population between 1200 and 8000, with an optimum size of 4000, that are designed to be homogeneous with respect to population, economic and housing characteristics. Census tracts are one chain of the “nested” geographic units, forming a partition of Counties (which partitions States and so on), meaning Census tract level data, while providing additional insights compared to larger areas, can still be easily grouped into County or higher levels when desired.

A Census block, not Census tract, is the smallest nested geographic unit, but they are often too small to be practical; their usefulness mostly comes in when geographic crosswalk is needed.

The GEOID of a Census tract is made up of 11 digits that identify the Federal Information Processing Series (FIPS) codes of the State, County, and the Census tract itself by pasting them together. An example from Census illustrates the idea, consider the census tract 48201223100, it can be broken down into:

Area	Digits	Example Area	Example GEOID
State	2	Texas	48
County	2+3=5	Harris County, TX	48201
Census Tract	2+3+6=11	Census Tract 2231 in Harris County, TX	48201223100

Given the shapefiles of Census tracts (geospatial vectors describing the boundaries Census tracts on a given coordinate reference system), the function st_within can determine whether a set of coordinates (an sfc_POINT) falls into any of the Census tracts. In the past, shapefiles could be queried directly from the API of Census in R using tidycensus, however, as of April 2025, Census has stopped responding to queries of geometry information; NHGIS comes in as a good backup, providing boundaries that are derived from the Census Bureau’s TIGER/Line files.

The following code is used to fetch Census tract GEOID given a crime data set with X- and Y-coordinates data.interest, and a data set of boundaries data.shape. Rows that are removed due to missing coordinates and rows that are removed due to not matching with any Census tracts are called out explicitly and will be looked into if warranted. On some very rare occasions, a POINT could fall into multiple Census tracts, those rows are also removed and called out. All data, should be data.table, not data.frame, except shape data obviously.

get.geometry <- function(data.interest, coords.name, data.shape, parallel = TRUE, crs = "EPSG:4326") {
  nrow.c <- nrow(data.interest)
  data.interest <- data.interest %>%
    filter(!(!!rlang::sym(coords.name[1]) %in% c("", "NA"))) %>%
    # don't refactor this using any(), you tried; omit if latitude or longitude is NA
    filter(!is.na(!!rlang::sym(coords.name[1]))) %>%
    filter(!is.na(!!rlang::sym(coords.name[2]))) %>%
    # omit if latitude or longitude is (floating point) zero
    filter(!abs(!!rlang::sym(coords.name[1])) < 1e-6) %>%
    filter(!abs(!!rlang::sym(coords.name[2])) < 1e-6)
  nrow.d <- nrow(data.interest)
  
  if (nrow.d < nrow.c) {cat("Dropped", nrow.c - nrow.d, "rows out of", nrow.c, "with no coordinates \n")}
  
  # get unique subset for mapping
  data.interest <- data.interest %>%
    mutate(tempID = as.factor(paste(!!rlang::sym(coords.name[1]), !!rlang::sym(coords.name[2]))))
  unique.coords <- data.interest %>%
    dplyr::select(tempID, !!rlang::sym(coords.name[1]), !!rlang::sym(coords.name[2])) %>%
    unique(by = "tempID")
  
  unique.coords$Geometry <- st_as_sf(
    as.data.frame(unique.coords %>% dplyr::select(all_of(coords.name))),
    coords = coords.name, crs = st_crs(crs)
  ) %>% # assuming coords come in WGS84, i.e., EPSG:4326
    st_transform(crs = st_crs(data.shape)) # convert to crs of shape file
  
  if (parallel) {
    cl <- makeCluster(detectCores(logical = FALSE) - 1, type = "PSOCK")
    clusterExport(cl, varlist = c("data.shape"), envir = environment())
    # make sure a list is passed in 2nd argument, can be unique.coords[, "Geometry"]
    # but not unique.coords[, `Geometry`] or unique.coords[, Geometry]
    unique.coords$block <- parLapplyLB(cl, list(unique.coords$Geometry), st_within, data.shape)
    stopCluster(cl)
    # gc() # un-comment garbage collection gc() if you are tight on ram
  } else{ unique.coords <- unique.coords %>% mutate(block = st_within(Geometry, data.shape)) }
  
  # do filtering here, before joining
  in.none = sum(unique.coords$block %>% lengths == 0)
  in.multiple = sum(unique.coords$block %>% lengths > 1)
  if (in.none > 0) { cat("Removed", in.none, "coordinates with unmatched geometry", "\n") }
  if (in.multiple > 0) { cat("Removed", in.multiple, "coordinates with multiple matched blocks", "\n") }
  
  data.interest <- data.interest %>%
    left_join(unique.coords[, .SD, .SDcols = !c(coords.name)], by = "tempID") %>%
    # call outs are for *unique coordinates*, number of rows removed may be higher
    filter(block %>% lengths > 0) %>% # remove (empty) in Sparse geometry binary predicate (sgbp) list
    filter(block %>% lengths < 2) %>% # also remove points within multiple shapes
    mutate(GEOID = data.shape$GEOID[as.numeric(unlist(block))]) %>%
    dplyr::select(-tempID)
  
  return(data.interest)
}

Running st_within on distinct coordinates

You may very reasonably assume that the probability of two incidents falling on the same coordinates is zero, however, in practice, most police departments locate crimes to either the nearest intersection or the 100 block address, then geocode accordingly. This greatly reduces the set of possible coordinates, and since the number of intersections in a city does not grow with the number of incidents, identifying unique coordinates before carrying out st_within will considerably speed up the code. The get.geometry function above implements this strategy; it also automatically removes coordinates that fail to match with any polygons or, on very rare occasions for Census tracts, match with multiple polygons.

Shapefile

If you are reading local files as your data.shape, you should be using a set of .shp, .shx and .dbf files and you would read the .shp file using sf::st_read.