Census Data

Published

May 5, 2025

Modified

May 11, 2025

Features of Census data to note

The most popular surveys are the 1-Year and 5-Year American Community Survey (ACS), and the Decennial Census. This project mainly utilizes ACS5, which is available at the Census tract level from 2009 onward through the Census API. Earlier surveys (ACS1 starting from 2005) are only publicly available for geographies with larger population, typically the county level.
Census redraws boundaries every decade to reflect changes in population, thus Census tracts are delineated differently every 10 years and the data from different decades has to be “crosswalked”. This page provides code needed to query and crosswalk Census data. For details of geo-crosswalk, also see guidelines by NHGIS. NHGIS provides the geo-crosswalk files for mapping data from 2000s and 2020s to the 2010s vintage, which will be the baseline delineation of this project.
In light of the need for geo-crosswalk, only count data variables should be queried. For example, per capita income B19301_001 is the same as dividing aggregate income B19313_001 by total population B01003_001, but the former cannot be crosswalked, so as any ratio or percentage variables. For these variables, the corresponding numerators and denominators should be queried, crosswalked when needed, then divided to recover the desired ratio data. By the same token, median data variables should also be avoided.

`tidycensus` API

Census data can be queried directly in R using the package tidycensus (see its documentation). To see the complete list of variables that can be queried through the Census API, use load_variables,

census.vars <- load_variables(2010, "acs5")

Some variables that could be of interest to demographic and economic research are:

Note that ratio variables such as “per capita income” B19301_001 are shown for demonstration only and will not be used. The population of a block group/tract B01003_001 (almost the same variable as B03001_001) enables the calculation of crime rates and per capita data, such as dividing “Aggregate income” B19313_001 by it. However, it is important to note that when calculating averages, typically there is a specific denominator for a given “concept” and is not necessarily the total population. For example, to get average rent, one should divide “Aggregate gross rent” B25065_001 by “Total” B25063_001 which is the total number of renters, and the appropriate denominator for the concept “Gross Rent”, despite the label sounding otherwise.

“Controls in Difference-in-Differences Don’t Just Work”

It should be noted that in a Difference-in-Differences setting, controls “don’t just work” and should not be treated the same way as one would in a linear regression setting. A detailed exposition by Nick Huntington-Klein illustrates the idea clearly.

You’re not really in the business of trying to account for differences between treated and control groups … we rely on the parallel trends assumption: if treatment had not occurred, the gap between treated and control groups would remain constant over time. And that’s what controls are for. Not to help justify a conditional independence assumption, but to justify a parallel trends assumption.

Thus in this project, Census variables will be used cautiously for the right purpose. They could inform the selection of matching subsets, or they could be included as controls if they do indeed recover parallel trends.

The function get_acs can be used to query the ACS5 variables,

SF20 <- get_acs(geography = "tract", variables = "B01003_001", state = "CA", 
                county = "San Francisco", year = 2020, geometry = TRUE)

The function can be wrapped inside a map_dfr (or the parallel version future_map_dfr) to get multiple variables from multiple years in one go,

get.census <- function(state.county, geography, years, variables, geometry = FALSE,
                       survey = "acs5", acs = TRUE, years.id = "year"){
  variables <- unname(variables)
  if (acs){
    temp <- future_map_dfr(
      years,
      ~ get_acs(
        geography = geography,
        variables = variables,
        state = state.county[[1]],
        county = state.county[[2]],
        year = .x,
        survey = "acs5",
        geometry = geometry,
        cache_table = TRUE
      ),
      .id = eval(years.id) # add id var (name of list item) when combining
    ) %>%
      arrange(variable, GEOID) %>% 
      mutate(year = as.numeric(year) + min(years)-1)
  } else { # 2000, 2010, or 2020 only
    temp <- get_decennial(
      geography = geography,
      variables = variables,
      state = state.county[[1]],
      county = state.county[[2]],
      year = years,
      geometry = geometry
    )
  }
  return(temp)
}

Which can further be wrapped inside a lapply to loop through a list of state and counties, the multisession can also be planned here; the function returns a cleaned data in wide format,

get.census.list <- function(s.c.list, geography, years, variables, geometry = FALSE){
  variables <- unname(variables)
  plan(multisession, workers = parallelly::availableCores())
  tic()
  data.list <- lapply(s.c.list, function(x){
    print(x)
    temp <- get.census(x, geography, years, variables, geometry = geometry) %>%
      select(year, GEOID, variable, estimate)
    temp <- dcast(setDT(temp), year+GEOID~variable, value.var = "estimate")
  })
  toc()
  return(rbindlist(data.list))
}

Geographic crosswalk

As instructed by NHGIS, any variables to be crosswalked should start from the most granular, lowest possible level, meaning if a variable is available at the block (group) level, it should be used instead of the tract level to maximize accuracy.

Important: Start from the Lowest Level You Can To transform summary data from one census’s geographic units to another’s, it’s important to start from the lowest possible level.

This section provides a fully worked demo of querying and crosswalking 2020 block group count variables to 2010 tract, together with all the necessary code. In simplest terms, geo-crosswalk is achieved by calculating a weighted average that proportionately accounts for variation in geographical units. The variation could arise from redrawing of boundaries (e.g., 2010 Census tracts vs 2020 Census tracts) or from crossing levels (e.g., 2010 block groups vs 2010 counties); for example, a 10 year panel from 2015 to 2025 would span two vintages of Census boundaries, hence ACS5 estimates from 2015-2019 and estimates from 2020-2025 would originate from different neighborhoods. Geo-crosswalk harmonizes data into one single fixed geographical unit of analysis.

Census tract level population in San Francisco comparison

Suppose a project requires all analysis to be done at the 2010 tract level, and two variables available at the block group level from New York City and Chicago have to be queried and crosswalked when necessary (no crosswalk needed for the 2010-2019 estimates). The variables can be obtained using get.census.list shown above (note the nested structure of the state.county list); suppose only data from 2018 to 2021 are needed, the earlier two years can be queried at the tract level directly, while the later two should be queried at the block group level (lowest level available) then crosswalked,

# get population and aggregate income
vars <- c("B01003_001", "B19313_001")
# construct list of state-counties
state.county <- list(list("NY", list("New York", "Queens", "Kings", "Bronx", "Richmond")),
                     list("IL", list("Cook", "DuPage")))

# query variables separately, with intention to crosswalk
data.10s <- get.census.list(state.county, "tract", 2018:2019, vars)
data.20s.bg <- get.census.list(state.county, "block group", 2020:2021, vars)

With the 2020s data in block group, a “2020 Block Groups → 2010 Census Tracts” crosswalk file from NHGIS should be used. The crosswalk file contains these essential elements:

bg2020ge: GEOIDs of 2020 block groups, these are the “source” or “starting” geographies
tr2010ge: GEOIDs of 2010 census tracts, these are the “target” or “ending” geographies
Interpolation weights, these are the weights to calculate the “weighted average”; the crosswalk file offers multiple weights, two of them are:
- wt_pop: expected proportion of source geographies’ population in target geographies
- wt_fam: expected proportion of source geographies’ families in target geographies

and so on. Choice of weights depends on the nature of the variables and hopefully the weight chosen can reflect the spatial distribution of the variables of interest.

bg2020gj	bg2020ge	tr2010gj	tr2010ge	wt_pop	wt_adult	wt_fam	wt_hh	wt_hu
G26009902253003	260992253003	G2600990225300	26099225300	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G54008100008043	540810008043	G5400810000804	54081000804	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G41003109603041	410319603041	G4100170000700	41017000700	0.0012320	0.0011179	0.0031898	0.0023923	0.0022272
G22007100017462	220710017462	G2200710001746	22071001746	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G34002100010004	340210010004	G3400210000900	34021000900	0.0035222	0.0035222	0.0035222	0.0035222	0.0035222
G21023509201003	212359201003	G2102350920100	21235920100	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G05011509513023	051159513023	G0501150951300	05115951300	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G04001308106003	040138106003	G0400130810600	04013810600	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000
G37010700114001	371070114001	G3701030920300	37103920300	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
G47014900411042	471490411042	G4701490041101	47149041101	1.0000000	1.0000000	1.0000000	1.0000000	1.0000000

Leading zeros

When reading crosswalk files, be sure to keep leading zeros of GEOIDs (which are really character strings, not integers). For fread, you may use

formals(fread)$keepLeadingZeros <- getOption("datatable.keepLeadingZeros", TRUE)

where the formals function changes the default values of fread and you might want to reset the value back to FALSE after reading GEOIDs.

The function that carries out the weighted average calculation is provided in full below, and it takes the following arguments:

crosswalk.file: a standard crosswalk file with source GEOID, target GEOID and interpolation weights
col.start: column name of the source column in the crosswalk file
col.end: column name of the target column in the crosswalk file
col.weight: column name of the interpolation weight column
data.walk: data that needs to be crosswalked (such as the data.20s.bg queried above)
cols.walk: a single or vector of column/variable names that need to be crosswalked
col.year: column name of the year column, in case data.walk is in long format and contains multiple years

census.crosswalk <- function(crosswalk.file, col.start, col.end, col.weight, 
                             data.walk, cols.walk, col.year = NULL){
  crosswalk.file <- as.data.table(crosswalk.file)
  data.walk <- as.data.table(data.walk)
  cols.walk <- unname(cols.walk) # just in case a named vector is passed
  
  # check data integrity
  if (any(!cols.walk %in% names(data.walk))){stop("Missing crosswalk columns!")}
  if (!("GEOID" %in% colnames(data.walk))){stop("GEOID not found in data.walk")}
  if (!all(unique(data.walk$GEOID) %in% crosswalk.file[, get(col.start)])){
    stop("GEOID does not exist in crosswalk.file, check if crosswalk file is correct")}
  if (!any(unique(data.walk$GEOID) %in% crosswalk.file[, get(col.end)])){
    cat("Note: Zero GEOID in col.end (expected for crosswalks across levels) \n")}
  if (!is.null(col.year)){
    years.mod = sort(unique(data.walk[, get(col.year)])) %% 10
    if (!all(sort(years.mod) == years.mod)){cat("Warning: Years not from same decade \n")}
  }
  
  
  # get desired group_by keys; year could be NULL
  cols.group <- c(eval(col.year), eval(col.end))
  
  temp <- crosswalk.file %>% 
    # filter only relevant GEOID
    filter(!!rlang::sym(col.start) %in% data.walk$GEOID) %>%
    # select starting GEOID, ending GEOID, and weight columns
    dplyr::select(!!rlang::sym(col.start), !!rlang::sym(col.end), !!rlang::sym(col.weight)) %>%
    # right join data.walk and look up estimates of corresponding col.start
    right_join(data.walk, by = join_by(!!rlang::sym(col.start) == GEOID), 
               keep = TRUE, relationship = "many-to-many") %>%
    # mutate/crosswalked variables; basically calculating a weighted average
    mutate_at(cols.walk, ~.x * get(col.weight)) %>%
    # aggregate back to a single row per GEOID per year
    group_by(across(any_of(cols.group))) %>%
    dplyr::summarise_at(cols.walk, ~round(sum(.x, na.rm = TRUE))) %>% 
    # rename GEOID column to "GEOID"
    rename_with(~ "GEOID", !!rlang::sym(col.end))
  
  return(temp)
}

the upper part of the function performs a couple of validity checks to ensure that the proper information is supplied; the line

# get desired group_by keys; year could be NULL
cols.group <- c(eval(col.year), eval(col.end))

identifies potential keys that are used in group_by, see group_by(across(any_of(cols.group))) below. The lower chunk of code performs a right_join of the crosswalk.file and data.walk, keeping all rows in data.walk (the “right” data table); the two data tables are joined by matching the source GEOID in the crosswalk file and the GEOID in data.walk as expected because these are exactly the GEOIDs that need to be “turned into” the target GEOIDs. In the join function, relationship is set to"many-to-many" precisely to match the source GEOIDs to all the target GEOIDs, as one source GEOID may have 0.5 weight on target GEOID 1, 0.3 weight on target GEOID 2, and 0.2 weight on target GEOID 3 and so on. The data at this step will appear as follows:

# assuming validity checks are all passed
temp <- crosswalk.file %>%
  # filter only relevant GEOID
    filter(!!rlang::sym(col.start) %in% data.walk$GEOID) %>%
    # select starting GEOID, ending GEOID, and weight columns
    dplyr::select(!!rlang::sym(col.start), !!rlang::sym(col.end), !!rlang::sym(col.weight)) %>%
    # right join data.walk and look up estimates of corresponding col.start
    right_join(data.walk, by = join_by(!!rlang::sym(col.start) == GEOID), 
               keep = TRUE, relationship = "many-to-many") %>%
    # mutate/crosswalked variables; basically calculating a weighted average
    mutate_at(cols.walk, ~.x * get(col.weight))

bg2020ge	tr2010ge	wt_pop	year	GEOID	B01003_001	B19313_001
170438418023	17043841403	0.0099870	2020	170438418023	14.6709490	650914.75
170438418023	17043841403	0.0099870	2021	170438418023	15.4199763	707614.12
170438418023	17043841802	0.9900130	2020	170438418023	1454.3290510	64525085.25
170438418023	17043841802	0.9900130	2021	170438418023	1528.5800237	70145685.88
170438419011	17043841901	0.9991683	2020	170438419011	1219.9844781	71993771.98
170438419011	17043841901	0.9991683	2021	170438419011	1343.8813456	82107553.21
170438419011	17043842000	0.0008317	2020	170438419011	1.0155219	59928.02
170438419011	17043842000	0.0008317	2021	170438419011	1.1186544	68346.79
170438419012	17043841901	1.0000000	2020	170438419012	1478.0000000	94718100.00
170438419012	17043841901	1.0000000	2021	170438419012	1509.0000000	109408500.00
170438419021	17043841902	0.9982555	2020	170438419021	1457.4529844	45351744.20
170438419021	17043841902	0.9982555	2021	170438419021	1437.4878751	50558544.90
170438419021	17043842200	0.0002732	2020	170438419021	0.3988285	12410.40
170438419021	17043842200	0.0002732	2021	170438419021	0.3933651	13835.22
170438419021	17043842300	0.0014714	2020	170438419021	2.1481869	66845.40
170438419021	17043842300	0.0014714	2021	170438419021	2.1187597	74519.87
170438419022	17043841902	1.0000000	2020	170438419022	2764.0000000	50470200.00
170438419022	17043841902	1.0000000	2021	170438419022	2765.0000000	51279100.00
170438420001	17043842000	0.9981638	2020	170438420001	1209.7745481	80541439.26
170438420001	17043842000	0.9981638	2021	170438420001	1096.9820366	88056115.57

# mutate/crosswalked variables; basically calculating a weighted average
mutate_at(cols.walk, ~.x * get(col.weight))

this line mutate all columns that are passed to cols.walk by multiplying with the interpolation weight column (wt_pop here); this is basically the first part of a vector multiplication/weighted average calculation, and the subsequent group_by and summarise with sum take care of the “summation” part of the calculation. The resulting data set would represent 2020 and 2021 population (B01003_001) and aggregate income (B19313_001) but in 2010s Census tract.

year	GEOID	B01003_001	B19313_001
2020	17031010100	4644	185453100
2020	17031010201	7488	172679400
2020	17031010202	3154	70535900
2020	17031010300	6135	212985000
2020	17031010400	5193	124331800
2020	17031010501	4039	128187900
2020	17031010502	2960	80647000
2020	17031010503	1973	38529700
2020	17031010600	6363	247171600
2020	17031010701	3562	111193500
2020	17031010702	5209	117978100
2020	17031020100	3742	123427200
2020	17031020200	6980	259245900
2020	17031020301	5447	163901700
2020	17031020302	5116	222301500
2020	17031020400	4011	170245300
2020	17031020500	6744	115295500
2020	17031020601	6112	172181800
2020	17031020602	5239	151466200
2020	17031020701	1764	76773200

Using the same code, crosswalks from 2000s vintage, across different levels of nested geographies, with data from multiple states/counties or variables can easily be done by changing only the relevant items, such as supplying the appropriate crosswalk file.

Do NOT crosswalk ratio or median variables

Again, the idea of crosswalking is to distribute a variable based on interpolation weights from one geography to another, thus it does not make sense to crosswalk variables that are not some sort of counts. You can test this by trying to crosswalk “per capita income” B19301_001, and you will see that the results are unreasonable. Clearly the same idea applies to median variables.

Variable operations

Once the data is queried (and stored in an object called census.data), any desired ratio variables or other transformations can be obtained or carried out; variables can also be given some easy-to-understand common.name. To avoid coding mistake/oversight and to streamline the workflow, it is strongly advised that the calculations be done dynamically (i.e., not hard coding the operations and not writing one line of code for each calculation). Taking the queried Census data from above, the code below will finish the task dynamically; although it is inevitable that for each ratio to be calculated, the numerator and denominator needs to be recorded somewhere, the procedure should minimize the chance of error.

Suppose the variables taken from load_variables are stored in a .csv file, a new column common.name can be created to store user-defined names that will replace the Census variable name in a later step.
For each new variable to be created, store the numerator and denominator needed in the column label (or any other columns deemed fit), leaving the name column empty. The format should be [numerator][separator][denominator], where ... is used here (because ÷ looks like three dots). To accommodate other operations including ones involving more than 2 variables, you can use a distinct operator for each type of operation and adapt the code accordingly. Also note how each ratio is defined very close to its “ingredients” to prevent mistakes when copy-pasting.

Create new variables by looping through any rows with a common.name but without a name. The code can be parallelized but given this is a small scale column-wise operation, it is unlikely that there will be significant performance gain.

# include this line if you have not already set proper na.strings
formals(fread)$na.strings <- c("", "NA", "N/A", "(null)", "NULL", "null")

census.vars <- fread("census.variables.csv")

for (x in census.vars[is.na(census.vars$name),]$common.name){
  row <- census.vars[census.vars$common.name == x, ]
  
  # variables separated by ... are to be divided
  # load the "zeallot" package to use the %<-% operator
  c(numer, denom) %<-% strsplit(row$label, split = "\\.\\.\\.")[[1]]
  census.data <- census.data %>%
    mutate(!!rlang::sym(x) := get(numer) / get(denom))
}

# clean up NaN's produced by zero dividing by zero
census.data <- census.data %>% mutate_all(~ifelse(is.nan(.), NA, .))

na.strings

When reading the csv file, to ensure that an empty cell will be interpreted as NA, make sure to set the na.strings (or equivalent) of your reading function to include "". You can also read file or files by using pattern of the file names and search recursively under the working directory using list.files with recursive = TRUE.

Set original variables to common.names. The following code assumes that for all variables, there is a common.name, so any variables with non-NA name can be matched with one. If this is not the case, simply loop through variables that need to be renamed instead.

census.data <- census.data %>% 
  setnames(census.vars[!is.na(census.vars$name),]$name,
           census.vars[!is.na(census.vars$name),]$common.name)

Step 3. and 4. can be interchanged, but doing so will add a layer of dependency as the label column needs to be updated every time a common.name is changed, while name is basically not going to change.

tidycensus API

Geographic crosswalk

Variable operations

`tidycensus` API