get.geometry <- function(data.interest, coords.name, data.shape, parallel = TRUE, crs = "EPSG:4326") {
nrow.c <- nrow(data.interest)
data.interest <- data.interest %>%
filter(!(!!rlang::sym(coords.name[1]) %in% c("", "NA"))) %>%
# don't refactor this using any(), you tried; omit if latitude or longitude is NA
filter(!is.na(!!rlang::sym(coords.name[1]))) %>%
filter(!is.na(!!rlang::sym(coords.name[2]))) %>%
# omit if latitude or longitude is (floating point) zero
filter(!abs(!!rlang::sym(coords.name[1])) < 1e-6) %>%
filter(!abs(!!rlang::sym(coords.name[2])) < 1e-6)
nrow.d <- nrow(data.interest)
if (nrow.d < nrow.c) {cat("Dropped", nrow.c - nrow.d, "rows out of", nrow.c, "with no coordinates \n")}
# get unique subset for mapping
data.interest <- data.interest %>%
mutate(tempID = as.factor(paste(!!rlang::sym(coords.name[1]), !!rlang::sym(coords.name[2]))))
unique.coords <- data.interest %>%
dplyr::select(tempID, !!rlang::sym(coords.name[1]), !!rlang::sym(coords.name[2])) %>%
unique(by = "tempID")
unique.coords$Geometry <- st_as_sf(
as.data.frame(unique.coords %>% dplyr::select(all_of(coords.name))),
coords = coords.name, crs = st_crs(crs)
) %>% # assuming coords come in WGS84, i.e., EPSG:4326
st_transform(crs = st_crs(data.shape)) # convert to crs of shape file
if (parallel) {
cl <- makeCluster(detectCores(logical = FALSE) - 1, type = "PSOCK")
clusterExport(cl, varlist = c("data.shape"), envir = environment())
# make sure a list is passed in 2nd argument, can be unique.coords[, "Geometry"]
# but not unique.coords[, `Geometry`] or unique.coords[, Geometry]
unique.coords$block <- parLapplyLB(cl, list(unique.coords$Geometry), st_within, data.shape)
stopCluster(cl)
# gc() # un-comment garbage collection gc() if you are tight on ram
} else { unique.coords <- unique.coords %>% mutate(block = st_within(Geometry, data.shape)) }
# do filtering here, before joining
in.none = sum(unique.coords$block %>% lengths == 0)
in.multiple = sum(unique.coords$block %>% lengths > 1)
if (in.none > 0) { cat("Dropped", in.none, "coordinates with unmatched geometry", "\n") }
if (in.multiple > 0) { cat("Dropped", in.multiple, "coordinates with multiple matched blocks", "\n") }
data.interest <- data.interest %>%
left_join(unique.coords[, .SD, .SDcols = !c(coords.name)], by = "tempID") %>%
# call outs are for *unique coordinates*, number of rows removed may be higher
filter(block %>% lengths > 0) %>% # remove (empty) in Sparse geometry binary predicate (sgbp) list
filter(block %>% lengths < 2) %>% # also remove points within multiple shapes
mutate(GEOID = data.shape$GEOID[as.numeric(unlist(block))]) %>%
dplyr::select(-tempID)
return(data.interest)
}Data Preparation
Inconsistent reporting standards between different sources constitute the first hurdle to using the data; the data sets from different agencies vary in terms of columns, names of columns, formats of date/times, coordinate reference systems (CRS), treatment of missing values, (lack of) treatment of duplicates, shapes (long versus wide), file types, etc. Sometimes even for the same agency, the reporting standard changes multiple times over the years. Despite the code written to deal with all sorts of edge cases being quite interesting, this section will mainly focus on explaining:
what columns/variables are included for analysis and how different data sets can be harmonized,
the treatment of missing values and what effort has been done to minimize data loss.
1 Unique incident identifiers
Very often the data comes with one or multiple columns that could identify an incident. One common type of ID is an ObjectId that exists within a database table, this type of ID is unique within a file but may duplicate each other if data is stored across multiple tables/files (such as yearly tables). This type of ID is not particularly informative either and depending on how data is inserted into the database, two different IDs may even point to the same incident, thus it is usually avoided when possible. Another common type of ID is a Case ID or Incident ID that police departments (PDs) use for internal record keeping (and is not standardized across different agencies), this type of ID is informative and duplicates can usually be traced to one of the following reasons:
Complete duplicates of the entire row - this happens most likely as a result of data handling oversight, meaning the same data was inserted twice (or more) without controlling for uniqueness. In this case the last occurrence is kept (or any one occurrence can be kept).
Duplicates of the same incident but with different populated columns - this typically happens when adding additional information to an existing case creates a new row, or when rows are not based on incidents but some other subjects, such as 911 calls (multiple calls pointing to the same incident) or victims (multiple victims in one crime). In this case only the last occurrence is kept so the crime is only counted once.
Multiple crimes for the same incident - this typically happens when multiple offenses are perpetrated in the same incident and each crime would have its own row but share the same case number. In this case all occurrences are kept and an alternative ID would be used or generated, typically by concatenating the case number and offense type so that each offense type is identified once per incident.
The treatment of duplicates varies across data sets to offset the different reporting practices adopted in the first place, and it is assumed that for one given agency, no matter which approach they use, their practice is consistent over time or at least consistent within files that share the same columns - sometimes agencies change their reporting standards and explicitly distinguish between “current” and “historical” data, in this case there will not be a problem as old and new data sets are handled as if they come from different agencies.
In each step of the cleaning process, the number of rows dropped are monitored and will be looked into if an abnormally large number of rows are lost; these steps are i) checking for duplicates, ii) mapping coordinates to census tracts, iii) filling incident dates, and iv) filling crime categories. In steps iii) and iv), the go-to response to missing data is to coalesce information from other columns, and rows will only be dropped when absolutely necessary.
Another informative ID would be one at the FBI level, such as an NIBRS ID or UCR ID, as it is standardized and unique across agencies, however, many agencies do not participate in NIBRS and thus an agency level Case ID remains the most practical.
2 Column harmonization
For each data set, in addition to an ID column, various essential and nice-to-have columns are identified and converted to a common set of column names and data types. Essential columns must be present for the data set to be included in the project. Nice-to-have columns can be missing, nonetheless they are identified for possible future use. The following tables present a (non-exhaustive) list of harmonized columns, with name, type, and description.
| Name | Type | Description |
|---|---|---|
CaseID |
character |
Incident identifier |
The essential date-time columns are either Date (for date) or POSIXct (for date-time) objects storing incident date or report date; they cannot be all missing; incident date is preferred to report date, but in case the former is missing, the latter will be used.
| Name | Type | Description |
|---|---|---|
IncidentDate |
Date |
Date of offense, typically in the form of “ymd” or “mdy” |
IncidentDatetime |
POSIXct |
Date-time of offense, typically in the form of “ymd_hms” or “mdy_hms” |
ReportDate |
Date |
Date of report |
ReportDatetime |
POSIXct |
Date-time of report |
The essential crime category columns are as follows; IncidentCategory is supposed to capture the column of crime categories that aligns best with the FBI Offense Definitions 2019, and IncidentDescription is supposed to capture the column of most granular offense description when available, however, if only one column of crime category is available, it will be taken automatically as IncidentCategory. Ideally the categories should be of factor type, however, in practice they are essentially character type, allowing entries of any text.
| Name | Type | Description |
|---|---|---|
IncidentCategory |
character |
Broad crime category, such as “Larceny Theft” |
IncidentCategory2 |
character |
Finer crime category OR alternative column |
IncidentCategory3 |
character |
Similar to above |
IncidentDescription |
character |
Finest crime category |
The essential geometry columns could be coordinates, sfc_POINT, or some city-specific grids that can be mapped to the census tract level (which rules out large geographic areas such as zip code).
| Name | Type | Description |
|---|---|---|
X |
double or int |
Some sort of X-coordinate, such as longitude or easting |
Y |
double or int |
Some sort of Y-coordinate, such as latitude or northing |
Point |
sfc_POINT |
Simple feature (sf) POINT object containing XY |
Geocode |
character or numeric |
City-specific geo-codes of crime incidents |
Technically a street address can also be converted into geographic coordinates (known as “address geocoding” or simply “geocoding”), but cities with only street or block addresses are not included at the moment due to techincal (financial) hurdles.
| Name | Type | Description |
|---|---|---|
Intersection |
character |
Street address, typically given as the nearest intersection or 100 block |
Coordinates could come in many different systems, one of the most commonly known is the spheroid World Geodetic System (WGS 84). But for US-based agencies, coordinates sometimes come in as the projected, two-dimensional Cartesian system, the North American Datum 1983 (NAD 83). For example, the coordinates of Millennium Park in Chicago can be represented as 41.8826°N, 87.6226°W (WGS 84) or 1177789 East, 1900624 North (NAD83/Illinois East ftUS EPSG:3435).
Some nice-to-have columns could be,
| Name | Type | Description |
|---|---|---|
Level |
factor |
Whether the crime is a felony or a misdemeanor |
Property value |
numeric |
If property crime, value of damage or loss |
Resolution |
factor |
Resolution of the incident, such as whether arrest is made |
Premise type |
character |
Incident location, such as store, residence, or street |
Victim information (age, gender, race, etc.) could also be helpful in unpacking whether certain demographics are disproportionately affected by crime. In general, report of these nice-to-have columns is much more inconsistent across agencies, however, if the agency reports to NIBRS, then it is more likely that these information can be obtained and linked to the geocoded data.
3 Mapping crime incidents to Census tracts
Census tract is chosen as the geographic unit of analysis to strike a balance between practicality and economic meaningfulness. US Census tracts are delineated every 10 years (at the beginning of a decade, 2010s and 2020s and so on) to be zones with a population between 1200 and 8000, with an optimum size of 4000, that are designed to be homogeneous with respect to population, economic and housing characteristics. Census tracts are one chain of the “nested” geographic units, forming a partition of Counties (which partitions States and so on), meaning Census tract level data, while providing additional insights compared to larger areas, can still be easily grouped into County or higher levels when desired.
A Census block, not Census tract, is the smallest nested geographic unit, but they are often too small to be practical; their usefulness mostly comes in when geographic crosswalk is needed.
The GEOID of a Census tract is made up of 11 digits that identify the Federal Information Processing Series (FIPS) codes of the State, County, and the Census tract itself by pasting them together. An example from Census illustrates the idea, consider the census tract 48201223100, it can be broken down into:
| Area | Digits | Example Area | Example GEOID |
|---|---|---|---|
| State | 2 | Texas | 48 |
| County | 2+3=5 | Harris County, TX | 48201 |
| Census Tract | 2+3+6=11 | Census Tract 2231 in Harris County, TX | 48201223100 |
Given the shapefiles of Census tracts (geospatial vectors describing the boundaries Census tracts on a given coordinate reference system), the function st_within can determine whether a set of coordinates (an sfc_POINT) falls into any of the Census tracts. In the past, shapefiles could be queried directly from the API of Census in R using tidycensus, however, as of April 2025, Census has stopped responding to queries of geometry information; NHGIS comes in as a good backup, providing boundaries that are derived from the Census Bureau’s TIGER/Line files.
The following code is used to fetch Census tract GEOID given a crime data set with X- and Y-coordinates data.interest, and a data set of boundaries data.shape. Rows that are removed due to missing coordinates and rows that are removed due to not matching with any Census tracts are called out explicitly and will be looked into if warranted. On some very rare occasions, a POINT could fall into multiple Census tracts, those rows are also removed and called out. All data, should be data.table, not data.frame, except shape data obviously.
st_within on distinct coordinates
You may very reasonably assume that the probability of two incidents falling on the same coordinates is zero, however, in practice, most police departments locate crimes to either the nearest intersection or the 100 block address, then geocode accordingly. This greatly reduces the set of possible coordinates, and since the number of intersections in a city does not grow with the number of incidents, identifying unique coordinates before carrying out st_within will considerably speed up the code. The get.geometry function above implements this strategy; it also automatically removes coordinates that fail to match with any polygons or, on very rare occasions for Census tracts, match with multiple polygons.
If you are reading local files as your data.shape, you should be using a set of .shp, .shx and .dbf files and you would read the .shp file using sf::st_read.
4 Mapping Census tracts to city boundaries
Because cities are not in the chain of nested Census geographic units, one needs to determine whether a Census tract actually “lies” within a city, i.e., is policed by the corresponding police department. The task of this section is to include Census tracts that fall inside the city boundary and exclude those that are only partially (or not at all) policed by the city PD; it is common that a Census tract may receive a tiny number of cases without intersecting the city boundary at all - this happens when the PD goes out of its jurisdiction, and such Census tracts should be excluded from the data, unless they fall under the jurisdiction of another city that is also covered by the data.
Given the proximity between the Boston police department and the Cambridge police department (which are both included in the data), it is possible that either of them will report cases that occur in each others’ jurisdiction, and since crime counting happens at the Census tract level, in this case the Census tract should be kept and the count should simply be the total number of cases reported by both departments. On the other hand, there could be unincoporated places inside a city, essentially creating holes in the city boundary (such as Santa Monica in Los Angeles), despite the city PD having no jurisdiction, the Census tract there might pick up a few counts of crime from time to time, and in this case these Census tracts should be excluded from the data.
With the objective in mind, the following step should be carried out after the mapped data is grouped by Census tracts, as there are Census tracts that appear in multiple cities’ crime data, specifically, dt[, lapply(.SD, sum, na.rm = FALSE), by = c("IncidentYear", "GEOID")] will do the job, and given all columns other than time and tract ID are crime counts (here NA values should be kept to preserve the knowledge of what crimes are not be reported by a given PD). All PDs or cities that are included in this data publish the either the police districts or the city boundaries, and the shape data for each of them has been acquired for each Census tract to be compared against some reasonable set of city boundaries using a two-step process:
Determine if the Census tract is completely inside (
sf::st_within) or does not at all touch (sf::st_intersectsreturnsFALSE) the boundary of the police department jurisdiction, where we keep the former and exclude the latter. For details of the operations, see either the sf package documentation or PostGIS documentation.For tracts that are in between, calculate a given tract’s proportion that is inside the target police jurisdiction boundary (by chaining
sf::st_intersectionandsf::st_area), then keep those that have more than half of the area inside. This is by all means not a perfect distinction but edge cases (the proportion of intersecting area being very close to 0.5) are rather uncommon in the first place as Census as the boundaries of Census tracts and cities often coincide.
When the target city of interest is a stand-alone city, the matching can happen as is, however, for cities that are close, next to or even inside one another (such as Phoenix-Chandler-Gilbert-Tempe in Arizona and Henderson-Las Vegas in Nevada), one should not try to match Census tracts to the city boundaries sequentially, instead, one should match Census against the union of the set of city polygons, as it is possible that a Census tract is shared between them. Technically, one can also compare each and every Census tract against one big unified polygon of all cities, however, this will be rather inefficient as the spatial operations like st_within or st_intersection are computationally expensive.
The most accurate way of acquiring a city PD’s jurisdiction boundary would be to download the police districts shape file published by the PD itself one by one, and if that is not available, the city council districts or city boundary could be a good alternative. For bulk downloads, one could also try tigris::places for boundaries of all Census designated places, which are usually equivalent to those of cities.
As a demonstration, Figure 1 shows a sequence of four layers: 2010 Census tracts (limited to within 2km of Atlanta’s boundary), 2020 crime incidents from Atlanta PD, Atlanta city boundary, and the Census tracts determined to be inside Atlanta using the procedure outlined in this section.
Some observations to note:
- There are incidents that happen outside of Atlanta PD’s jurisdiction (see top-right corner), the tract that these incidents fall into are excluded.
- Around the fringes of Atlanta, most tracts coincide pretty well with the city boundary, and those with over 50% partial overlap are included and vice versa.
- Tracts that are deemed inside the city boundary but have never recorded any crimes over the entire panel of data are excluded. For example, the tract that covers Hartsfield-Jackson Airport’s runway (towards the bottom) is excluded because it has never recorded any incidents over the entire history (not just in 2020) of data, while the tract labelled 118 towards the right is included because there are recorded incidents in years other than 2020 (not shown here).
Copyright © April 2025. Hung Kit Chiu.