Census Data

Published

May 5, 2025

Modified

May 11, 2025

Features of Census data to note
  • The most popular surveys are the 1-Year and 5-Year American Community Survey (ACS), and the Decennial Census. This project mainly utilizes ACS5, which is available at the Census tract level from 2009 onward through the Census API. Earlier surveys (ACS1 starting from 2005) are only publicly available for geographies with larger population, typically the county level.

  • Census redraws boundaries every decade to reflect changes in population, thus Census tracts are delineated differently every 10 years and the data from different decades has to be “crosswalked”. This page provides code needed to query and crosswalk Census data. For details of geo-crosswalk, also see guidelines by NHGIS. NHGIS provides the geo-crosswalk files for mapping data from 2000s and 2020s to the 2010s vintage, which will be the baseline delineation of this project.

  • In light of the need for geo-crosswalk, only count data variables should be queried. For example, per capita income B19301_001 is the same as dividing aggregate income B19313_001 by total population B01003_001, but the former cannot be crosswalked, so as any ratio or percentage variables. For these variables, the corresponding numerators and denominators should be queried, crosswalked when needed, then divided to recover the desired ratio data. By the same token, median data variables should also be avoided.

tidycensus API

Census data can be queried directly in R using the package tidycensus (see its documentation). To see the complete list of variables that can be queried through the Census API, use load_variables,

census.vars <- load_variables(2010, "acs5")

Some variables that could be of interest to demographic and economic research are:

Note that ratio variables such as “per capita income” B19301_001 are shown for demonstration only and will not be used. The population of a block group/tract B01003_001 (almost the same variable as B03001_001) enables the calculation of crime rates and per capita data, such as dividing “Aggregate income” B19313_001 by it. However, it is important to note that when calculating averages, typically there is a specific denominator for a given “concept” and is not necessarily the total population. For example, to get average rent, one should divide “Aggregate gross rent” B25065_001 by “Total” B25063_001 which is the total number of renters, and the appropriate denominator for the concept “Gross Rent”, despite the label sounding otherwise.

“Controls in Difference-in-Differences Don’t Just Work”

It should be noted that in a Difference-in-Differences setting, controls “don’t just work” and should not be treated the same way as one would in a linear regression setting. A detailed exposition by Nick Huntington-Klein illustrates the idea clearly.

You’re not really in the business of trying to account for differences between treated and control groups … we rely on the parallel trends assumption: if treatment had not occurred, the gap between treated and control groups would remain constant over time. And that’s what controls are for. Not to help justify a conditional independence assumption, but to justify a parallel trends assumption.

Thus in this project, Census variables will be used cautiously for the right purpose. They could inform the selection of matching subsets, or they could be included as controls if they do indeed recover parallel trends.

The function get_acs can be used to query the ACS5 variables,

SF20 <- get_acs(geography = "tract", variables = "B01003_001", state = "CA", 
                county = "San Francisco", year = 2020, geometry = TRUE)

The function can be wrapped inside a map_dfr (or the parallel version future_map_dfr) to get multiple variables from multiple years in one go,

get.census <- function(state.county, geography, years, variables, geometry = FALSE,
                       survey = "acs5", acs = TRUE, years.id = "year"){
  variables <- unname(variables)
  if (acs){
    temp <- future_map_dfr(
      years,
      ~ get_acs(
        geography = geography,
        variables = variables,
        state = state.county[[1]],
        county = state.county[[2]],
        year = .x,
        survey = "acs5",
        geometry = geometry,
        cache_table = TRUE
      ),
      .id = eval(years.id) # add id var (name of list item) when combining
    ) %>%
      arrange(variable, GEOID) %>% 
      mutate(year = as.numeric(year) + min(years)-1)
  } else { # 2000, 2010, or 2020 only
    temp <- get_decennial(
      geography = geography,
      variables = variables,
      state = state.county[[1]],
      county = state.county[[2]],
      year = years,
      geometry = geometry
    )
  }
  return(temp)
}

Which can further be wrapped inside a lapply to loop through a list of state and counties, the multisession can also be planned here; the function returns a cleaned data in wide format,

get.census.list <- function(s.c.list, geography, years, variables, geometry = FALSE){
  variables <- unname(variables)
  plan(multisession, workers = parallelly::availableCores())
  tic()
  data.list <- lapply(s.c.list, function(x){
    print(x)
    temp <- get.census(x, geography, years, variables, geometry = geometry) %>%
      select(year, GEOID, variable, estimate)
    temp <- dcast(setDT(temp), year+GEOID~variable, value.var = "estimate")
  })
  toc()
  return(rbindlist(data.list))
}

Geographic crosswalk

As instructed by NHGIS, any variables to be crosswalked should start from the most granular, lowest possible level, meaning if a variable is available at the block (group) level, it should be used instead of the tract level to maximize accuracy.

Important: Start from the Lowest Level You Can To transform summary data from one census’s geographic units to another’s, it’s important to start from the lowest possible level.

This section provides a fully worked demo of querying and crosswalking 2020 block group count variables to 2010 tract, together with all the necessary code. In simplest terms, geo-crosswalk is achieved by calculating a weighted average that proportionately accounts for variation in geographical units. The variation could arise from redrawing of boundaries (e.g., 2010 Census tracts vs 2020 Census tracts) or from crossing levels (e.g., 2010 block groups vs 2010 counties); for example, a 10 year panel from 2015 to 2025 would span two vintages of Census boundaries, hence ACS5 estimates from 2015-2019 and estimates from 2020-2025 would originate from different neighborhoods. Geo-crosswalk harmonizes data into one single fixed geographical unit of analysis.

Census tract level population in San Francisco comparison

Suppose a project requires all analysis to be done at the 2010 tract level, and two variables available at the block group level from New York City and Chicago have to be queried and crosswalked when necessary (no crosswalk needed for the 2010-2019 estimates). The variables can be obtained using get.census.list shown above (note the nested structure of the state.county list); suppose only data from 2018 to 2021 are needed, the earlier two years can be queried at the tract level directly, while the later two should be queried at the block group level (lowest level available) then crosswalked,

# get population and aggregate income
vars <- c("B01003_001", "B19313_001")
# construct list of state-counties
state.county <- list(list("NY", list("New York", "Queens", "Kings", "Bronx", "Richmond")),
                     list("IL", list("Cook", "DuPage")))

# query variables separately, with intention to crosswalk
data.10s <- get.census.list(state.county, "tract", 2018:2019, vars)
data.20s.bg <- get.census.list(state.county, "block group", 2020:2021, vars)

With the 2020s data in block group, a “2020 Block Groups → 2010 Census Tracts” crosswalk file from NHGIS should be used. The crosswalk file contains these essential elements:

  • bg2020ge: GEOIDs of 2020 block groups, these are the “source” or “starting” geographies

  • tr2010ge: GEOIDs of 2010 census tracts, these are the “target” or “ending” geographies

  • Interpolation weights, these are the weights to calculate the “weighted average”; the crosswalk file offers multiple weights, two of them are:

    • wt_pop: expected proportion of source geographies’ population in target geographies

    • wt_fam: expected proportion of source geographies’ families in target geographies

and so on. Choice of weights depends on the nature of the variables and hopefully the weight chosen can reflect the spatial distribution of the variables of interest.

bg2020gj bg2020ge tr2010gj tr2010ge wt_pop wt_adult wt_fam wt_hh wt_hu
G26009902253003 260992253003 G2600990225300 26099225300 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G54008100008043 540810008043 G5400810000804 54081000804 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G41003109603041 410319603041 G4100170000700 41017000700 0.0012320 0.0011179 0.0031898 0.0023923 0.0022272
G22007100017462 220710017462 G2200710001746 22071001746 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G34002100010004 340210010004 G3400210000900 34021000900 0.0035222 0.0035222 0.0035222 0.0035222 0.0035222
G21023509201003 212359201003 G2102350920100 21235920100 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G05011509513023 051159513023 G0501150951300 05115951300 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G04001308106003 040138106003 G0400130810600 04013810600 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
G37010700114001 371070114001 G3701030920300 37103920300 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
G47014900411042 471490411042 G4701490041101 47149041101 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Leading zeros

When reading crosswalk files, be sure to keep leading zeros of GEOIDs (which are really character strings, not integers). For fread, you may use

formals(fread)$keepLeadingZeros <- getOption("datatable.keepLeadingZeros", TRUE)

where the formals function changes the default values of fread and you might want to reset the value back to FALSE after reading GEOIDs.

The function that carries out the weighted average calculation is provided in full below, and it takes the following arguments:

  • crosswalk.file: a standard crosswalk file with source GEOID, target GEOID and interpolation weights

  • col.start: column name of the source column in the crosswalk file

  • col.end: column name of the target column in the crosswalk file

  • col.weight: column name of the interpolation weight column

  • data.walk: data that needs to be crosswalked (such as the data.20s.bg queried above)

  • cols.walk: a single or vector of column/variable names that need to be crosswalked

  • col.year: column name of the year column, in case data.walk is in long format and contains multiple years

census.crosswalk <- function(crosswalk.file, col.start, col.end, col.weight, 
                             data.walk, cols.walk, col.year = NULL){
  crosswalk.file <- as.data.table(crosswalk.file)
  data.walk <- as.data.table(data.walk)
  cols.walk <- unname(cols.walk) # just in case a named vector is passed
  
  # check data integrity
  if (any(!cols.walk %in% names(data.walk))){stop("Missing crosswalk columns!")}
  if (!("GEOID" %in% colnames(data.walk))){stop("GEOID not found in data.walk")}
  if (!all(unique(data.walk$GEOID) %in% crosswalk.file[, get(col.start)])){
    stop("GEOID does not exist in crosswalk.file, check if crosswalk file is correct")}
  if (!any(unique(data.walk$GEOID) %in% crosswalk.file[, get(col.end)])){
    cat("Note: Zero GEOID in col.end (expected for crosswalks across levels) \n")}
  if (!is.null(col.year)){
    years.mod = sort(unique(data.walk[, get(col.year)])) %% 10
    if (!all(sort(years.mod) == years.mod)){cat("Warning: Years not from same decade \n")}
  }
  
  
  # get desired group_by keys; year could be NULL
  cols.group <- c(eval(col.year), eval(col.end))
  
  temp <- crosswalk.file %>% 
    # filter only relevant GEOID
    filter(!!rlang::sym(col.start) %in% data.walk$GEOID) %>%
    # select starting GEOID, ending GEOID, and weight columns
    dplyr::select(!!rlang::sym(col.start), !!rlang::sym(col.end), !!rlang::sym(col.weight)) %>%
    # right join data.walk and look up estimates of corresponding col.start
    right_join(data.walk, by = join_by(!!rlang::sym(col.start) == GEOID), 
               keep = TRUE, relationship = "many-to-many") %>%
    # mutate/crosswalked variables; basically calculating a weighted average
    mutate_at(cols.walk, ~.x * get(col.weight)) %>%
    # aggregate back to a single row per GEOID per year
    group_by(across(any_of(cols.group))) %>%
    dplyr::summarise_at(cols.walk, ~round(sum(.x, na.rm = TRUE))) %>% 
    # rename GEOID column to "GEOID"
    rename_with(~ "GEOID", !!rlang::sym(col.end))
  
  return(temp)
}

the upper part of the function performs a couple of validity checks to ensure that the proper information is supplied; the line

# get desired group_by keys; year could be NULL
cols.group <- c(eval(col.year), eval(col.end))

identifies potential keys that are used in group_by, see group_by(across(any_of(cols.group))) below. The lower chunk of code performs a right_join of the crosswalk.file and data.walk, keeping all rows in data.walk (the “right” data table); the two data tables are joined by matching the source GEOID in the crosswalk file and the GEOID in data.walk as expected because these are exactly the GEOIDs that need to be “turned into” the target GEOIDs. In the join function, relationship is set to"many-to-many" precisely to match the source GEOIDs to all the target GEOIDs, as one source GEOID may have 0.5 weight on target GEOID 1, 0.3 weight on target GEOID 2, and 0.2 weight on target GEOID 3 and so on. The data at this step will appear as follows:

# assuming validity checks are all passed
temp <- crosswalk.file %>%
  # filter only relevant GEOID
    filter(!!rlang::sym(col.start) %in% data.walk$GEOID) %>%
    # select starting GEOID, ending GEOID, and weight columns
    dplyr::select(!!rlang::sym(col.start), !!rlang::sym(col.end), !!rlang::sym(col.weight)) %>%
    # right join data.walk and look up estimates of corresponding col.start
    right_join(data.walk, by = join_by(!!rlang::sym(col.start) == GEOID), 
               keep = TRUE, relationship = "many-to-many") %>%
    # mutate/crosswalked variables; basically calculating a weighted average
    mutate_at(cols.walk, ~.x * get(col.weight))
bg2020ge tr2010ge wt_pop year GEOID B01003_001 B19313_001
170438418023 17043841403 0.0099870 2020 170438418023 14.6709490 650914.75
170438418023 17043841403 0.0099870 2021 170438418023 15.4199763 707614.12
170438418023 17043841802 0.9900130 2020 170438418023 1454.3290510 64525085.25
170438418023 17043841802 0.9900130 2021 170438418023 1528.5800237 70145685.88
170438419011 17043841901 0.9991683 2020 170438419011 1219.9844781 71993771.98
170438419011 17043841901 0.9991683 2021 170438419011 1343.8813456 82107553.21
170438419011 17043842000 0.0008317 2020 170438419011 1.0155219 59928.02
170438419011 17043842000 0.0008317 2021 170438419011 1.1186544 68346.79
170438419012 17043841901 1.0000000 2020 170438419012 1478.0000000 94718100.00
170438419012 17043841901 1.0000000 2021 170438419012 1509.0000000 109408500.00
170438419021 17043841902 0.9982555 2020 170438419021 1457.4529844 45351744.20
170438419021 17043841902 0.9982555 2021 170438419021 1437.4878751 50558544.90
170438419021 17043842200 0.0002732 2020 170438419021 0.3988285 12410.40
170438419021 17043842200 0.0002732 2021 170438419021 0.3933651 13835.22
170438419021 17043842300 0.0014714 2020 170438419021 2.1481869 66845.40
170438419021 17043842300 0.0014714 2021 170438419021 2.1187597 74519.87
170438419022 17043841902 1.0000000 2020 170438419022 2764.0000000 50470200.00
170438419022 17043841902 1.0000000 2021 170438419022 2765.0000000 51279100.00
170438420001 17043842000 0.9981638 2020 170438420001 1209.7745481 80541439.26
170438420001 17043842000 0.9981638 2021 170438420001 1096.9820366 88056115.57
# mutate/crosswalked variables; basically calculating a weighted average
mutate_at(cols.walk, ~.x * get(col.weight))

this line mutate all columns that are passed to cols.walk by multiplying with the interpolation weight column (wt_pop here); this is basically the first part of a vector multiplication/weighted average calculation, and the subsequent group_by and summarise with sum take care of the “summation” part of the calculation. The resulting data set would represent 2020 and 2021 population (B01003_001) and aggregate income (B19313_001) but in 2010s Census tract.

year GEOID B01003_001 B19313_001
2020 17031010100 4644 185453100
2020 17031010201 7488 172679400
2020 17031010202 3154 70535900
2020 17031010300 6135 212985000
2020 17031010400 5193 124331800
2020 17031010501 4039 128187900
2020 17031010502 2960 80647000
2020 17031010503 1973 38529700
2020 17031010600 6363 247171600
2020 17031010701 3562 111193500
2020 17031010702 5209 117978100
2020 17031020100 3742 123427200
2020 17031020200 6980 259245900
2020 17031020301 5447 163901700
2020 17031020302 5116 222301500
2020 17031020400 4011 170245300
2020 17031020500 6744 115295500
2020 17031020601 6112 172181800
2020 17031020602 5239 151466200
2020 17031020701 1764 76773200

Using the same code, crosswalks from 2000s vintage, across different levels of nested geographies, with data from multiple states/counties or variables can easily be done by changing only the relevant items, such as supplying the appropriate crosswalk file.

Do NOT crosswalk ratio or median variables

Again, the idea of crosswalking is to distribute a variable based on interpolation weights from one geography to another, thus it does not make sense to crosswalk variables that are not some sort of counts. You can test this by trying to crosswalk “per capita income” B19301_001, and you will see that the results are unreasonable. Clearly the same idea applies to median variables.

Variable operations

Once the data is queried (and stored in an object called census.data), any desired ratio variables or other transformations can be obtained or carried out; variables can also be given some easy-to-understand common.name. To avoid coding mistake/oversight and to streamline the workflow, it is strongly advised that the calculations be done dynamically (i.e., not hard coding the operations and not writing one line of code for each calculation). Taking the queried Census data from above, the code below will finish the task dynamically; although it is inevitable that for each ratio to be calculated, the numerator and denominator needs to be recorded somewhere, the procedure should minimize the chance of error.

  1. Suppose the variables taken from load_variables are stored in a .csv file, a new column common.name can be created to store user-defined names that will replace the Census variable name in a later step.

  2. For each new variable to be created, store the numerator and denominator needed in the column label (or any other columns deemed fit), leaving the name column empty. The format should be [numerator][separator][denominator], where ... is used here (because ÷ looks like three dots). To accommodate other operations including ones involving more than 2 variables, you can use a distinct operator for each type of operation and adapt the code accordingly. Also note how each ratio is defined very close to its “ingredients” to prevent mistakes when copy-pasting.

  1. Create new variables by looping through any rows with a common.name but without a name. The code can be parallelized but given this is a small scale column-wise operation, it is unlikely that there will be significant performance gain.
# include this line if you have not already set proper na.strings
formals(fread)$na.strings <- c("", "NA", "N/A", "(null)", "NULL", "null")

census.vars <- fread("census.variables.csv")

for (x in census.vars[is.na(census.vars$name),]$common.name){
  row <- census.vars[census.vars$common.name == x, ]
  
  # variables separated by ... are to be divided
  # load the "zeallot" package to use the %<-% operator
  c(numer, denom) %<-% strsplit(row$label, split = "\\.\\.\\.")[[1]]
  census.data <- census.data %>%
    mutate(!!rlang::sym(x) := get(numer) / get(denom))
}

# clean up NaN's produced by zero dividing by zero
census.data <- census.data %>% mutate_all(~ifelse(is.nan(.), NA, .))
na.strings

When reading the csv file, to ensure that an empty cell will be interpreted as NA, make sure to set the na.strings (or equivalent) of your reading function to include "". You can also read file or files by using pattern of the file names and search recursively under the working directory using list.files with recursive = TRUE.

  1. Set original variables to common.names. The following code assumes that for all variables, there is a common.name, so any variables with non-NA name can be matched with one. If this is not the case, simply loop through variables that need to be renamed instead.
census.data <- census.data %>% 
  setnames(census.vars[!is.na(census.vars$name),]$name,
           census.vars[!is.na(census.vars$name),]$common.name)

Step 3. and 4. can be interchanged, but doing so will add a layer of dependency as the label column needs to be updated every time a common.name is changed, while name is basically not going to change.

Copyright © April 2025. Hung Kit Chiu. Please do not circulate outside of UCSB Econ.