# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "CA"Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen California for this analysis because it has a large and diverse population and well-documented demographic patterns that make it useful for census-based analysis.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
county_data <- get_acs(
geography = "county",
variables = c(
median_income = "B19013_001",
total_population = "B01003_001"
),
state = my_state,
year = 2022,
survey = "acs5",
output = "wide"
)
# Clean the county names to remove state name and "County"
county_data <- county_data %>%
mutate(
NAME = str_remove(NAME, " County"),
NAME = str_remove(NAME, ", California")
)
# Hint: use mutate() with str_remove()
# Display the first few rows
head(county_data)# A tibble: 6 × 6
GEOID NAME median_incomeE median_incomeM total_populationE total_populationM
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06001 Alame… 122488 1231 1663823 NA
2 06003 Alpine 101125 17442 1515 206
3 06005 Amador 74853 6048 40577 NA
4 06007 Butte 66085 2261 213605 NA
5 06009 Calav… 77526 3875 45674 NA
6 06011 Colusa 69619 5745 21811 NA
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
income_reliability <- county_data %>%
mutate(
income_moe_pct = (median_incomeM / median_incomeE) * 100,
income_reliability = case_when(
income_moe_pct < 5 ~ "High Confidence",
income_moe_pct >= 5 & income_moe_pct <= 10 ~ "Moderate Confidence",
income_moe_pct > 10 ~ "Low Confidence"
),
unreliable_income = income_moe_pct > 10
)
# Create a summary showing count of counties in each reliability category
income_reliability_summary <- income_reliability %>%
count(income_reliability) %>%
mutate(
percent = (n / sum(n)) * 100
)
income_reliability_summary# A tibble: 3 × 3
income_reliability n percent
<chr> <int> <dbl>
1 High Confidence 42 72.4
2 Low Confidence 5 8.62
3 Moderate Confidence 11 19.0
# Hint: use count() and mutate() to add percentages2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
top5_high_uncertainty <- income_reliability %>%
arrange(desc(income_moe_pct)) %>%
slice(1:5) %>%
select(
county = NAME,
median_income = median_incomeE,
margin_of_error = median_incomeM,
moe_percent = income_moe_pct,
reliability = income_reliability
) %>%
mutate(
median_income = scales::dollar(median_income, accuracy = 1),
margin_of_error = scales::dollar(margin_of_error, accuracy = 1),
moe_percent = round(moe_percent, 2)
)
# Format as table with kable() - include appropriate column names and caption
top5_high_uncertainty %>%
kable(
col.names = c("County", "Median Household Income", "Margin of Error", "MOE (%)", "Reliability Category"),
caption = "Top 5 California Counties with the Highest Median Income MOE Percentages (ACS 2022 5-year)"
)| County | Median Household Income | Margin of Error | MOE (%) | Reliability Category |
|---|---|---|---|---|
| Mono | $82,038 | $15,388 | 18.76 | Low Confidence |
| Alpine | $101,125 | $17,442 | 17.25 | Low Confidence |
| Sierra | $61,108 | $9,237 | 15.12 | Low Confidence |
| Trinity | $47,317 | $5,890 | 12.45 | Low Confidence |
| Plumas | $67,885 | $7,772 | 11.45 | Low Confidence |
Data Quality Commentary:
Counties with high income margin of error percentages are more likely to be poorly represented because the underlying estimates are more uncertain. These counties are often rural or have highly variable income distributions, which can reduce survey sample sizes and increase estimation error. As a result, algorithms using this data may misclassify economic conditions.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
selected_counties <- income_reliability %>%
filter(income_reliability %in% c("High Confidence", "Moderate Confidence", "Low Confidence")) %>%
group_by(income_reliability) %>%
slice_max(income_moe_pct, n = 1) %>%
ungroup() %>%
# Store the selected counties in a variable called selected_counties
select(
county = NAME,
median_income = median_incomeE,
moe_percent = income_moe_pct,
reliability_category = income_reliability
)
# Display the selected counties with their key characteristics
selected_counties# A tibble: 3 × 4
county median_income moe_percent reliability_category
<chr> <dbl> <dbl> <chr>
1 Calaveras 77526 5.00 High Confidence
2 Mono 82038 18.8 Low Confidence
3 Modoc 54962 9.80 Moderate Confidence
# Show: county name, median income, MOE percentage, reliability categoryComment on the output: The selected counties are different in data reliability. High-confidence counties show relatively low income uncertainty and low-confidence counties have higher MOE percentages. This contrast shows how data quality can vary significantly across regions when applying income based algorithms uniformly across all counties.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
race_vars <- c(
total_pop = "B03002_001",
white_alone = "B03002_003",
black_alone = "B03002_004",
hispanic_latino = "B03002_012"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
geography = "tract",
variables = c(
total_population = "B03002_001",
white_alone = "B03002_003",
black_alone = "B03002_004",
hispanic_latino = "B03002_012"
),
state = my_state,
county = c("009", "051", "049"), #County codes from previous table
year = 2022,
survey = "acs5",
output = "wide"
)
# Calculate percentage of each group using mutate()
tract_data <- tract_data %>%
mutate(
pct_white = (white_aloneE / total_populationE) * 100,
pct_black = (black_aloneE / total_populationE) * 100,
pct_hispanic = (hispanic_latinoE / total_populationE) * 100
)
# Create percentages for white, Black, and Hispanic populations
tract_data <- tract_data %>%
mutate(
county_fips = str_sub(GEOID, 3, 5),
tract_code = str_sub(GEOID, 6, 11),
county_name = str_extract(NAME, "(?<=; )[^;]+(?=County)")
)
# Add readable tract and county name columns using str_extract() or similar
tract_data %>%
select(county_name, tract_code, total_populationE, pct_white, pct_black, pct_hispanic) %>%
head()# A tibble: 6 × 6
county_name tract_code total_populationE pct_white pct_black pct_hispanic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 "Calaveras " 000121 4326 83.1 0 8.30
2 "Calaveras " 000122 3297 86.7 1.15 6.28
3 "Calaveras " 000123 2940 67.9 4.05 23.8
4 "Calaveras " 000124 2152 77.5 0 19.3
5 "Calaveras " 000220 6271 67.2 0.0159 21.4
6 "Calaveras " 000221 6691 82.5 0 12.9
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
top_hispanic_tract <- tract_data %>%
arrange(desc(pct_hispanic)) %>%
slice(1) %>%
select(NAME, GEOID, county_fips, tract_code, total_populationE, pct_white, pct_black, pct_hispanic)
# Hint: use arrange() and slice() to get the top tract
top_hispanic_tract %>%
kable(
col.names = c("Tract Name", "GEOID", "County FIPS", "Tract Code", "Total Pop", "% White", "% Black", "% Hispanic/Latino"),
caption = "Tract with the Highest Percentage of Hispanic/Latino Residents (Selected Counties)"
)| Tract Name | GEOID | County FIPS | Tract Code | Total Pop | % White | % Black | % Hispanic/Latino |
|---|---|---|---|---|---|---|---|
| Census Tract 2.01; Mono County; California | 06051000201 | 051 | 000201 | 3213 | 63.99004 | 0 | 33.70682 |
# Calculate average demographics by county using group_by() and summarize()
county_demographics_summary <- tract_data %>%
group_by(county_fips) %>%
summarize(
n_tracts = n(),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
) %>%
arrange(desc(avg_pct_hispanic)) %>%
mutate(
avg_pct_white = round(avg_pct_white, 2),
avg_pct_black = round(avg_pct_black, 2),
avg_pct_hispanic = round(avg_pct_hispanic, 2)
)
# Show: number of tracts, average percentage for each racial/ethnic group
county_demographics_summary %>%
kable(
col.names = c(
"County FIPS",
"Number of Tracts",
"Avg % White",
"Avg % Black",
"Avg % Hispanic/Latino"
),
caption = "Average Tract Demographics by County (Selected Counties)"
)| County FIPS | Number of Tracts | Avg % White | Avg % Black | Avg % Hispanic/Latino |
|---|---|---|---|---|
| 051 | 4 | 64.14 | 0.23 | 27.82 |
| 049 | 4 | 76.57 | 1.43 | 15.06 |
| 009 | 14 | 80.98 | 0.93 | 11.60 |
# Create a nicely formatted table of your results using kable()Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
tract_data <- tract_data %>%
mutate(
white_moe_pct = (white_aloneM / white_aloneE) * 100,
black_moe_pct = (black_aloneM / black_aloneE) * 100,
hispanic_moe_pct = (hispanic_latinoM / hispanic_latinoE) * 100
)
# Hint: use the same formula as before (margin/estimate * 100)
# Create a flag for tracts with high MOE on any demographic variable
tract_data <- tract_data %>%
mutate(
high_demo_moe = ifelse(
white_moe_pct > 15 | black_moe_pct > 15 | hispanic_moe_pct > 15,
TRUE,
FALSE
)
)
# Use logical operators (| for OR) in an ifelse() statement
# Create summary statistics showing how many tracts have data quality issues
demo_moe_summary <- tract_data %>%
group_by(county_name) %>%
summarize(
total_tracts = n(),
tracts_high_moe = sum(high_demo_moe, na.rm = TRUE),
percent_high_moe = (tracts_high_moe / total_tracts) * 100
)
demo_moe_summary# A tibble: 3 × 4
county_name total_tracts tracts_high_moe percent_high_moe
<chr> <int> <int> <dbl>
1 "Calaveras " 14 14 100
2 "Modoc " 4 4 100
3 "Mono " 4 4 100
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Due to previous results which reflects all high percentages in MOE, the data is now classified as any demographic that has MOE > 15% but individually selected
tract_data1 <- tract_data %>%
mutate(
high_white = white_moe_pct >15,
high_black = black_moe_pct >15,
high_hispanic = hispanic_moe_pct >15,
high_count = high_white + high_black + high_hispanic,
severity = case_when(
high_count == 3 ~ "High",
high_count == 2 ~ "Moderate",
TRUE ~ "Low"
)
)
# Calculate average characteristics for each group:
pattern_tbl <- tract_data1 %>%
group_by(severity) %>%
summarise(
n_tract = n(),
pop = mean(total_populationE, na.rm = TRUE),
avg_white = weighted.mean(pct_white, total_populationE, na.rm = TRUE),
avg_black = weighted.mean(pct_black, total_populationE, na.rm = TRUE),
avg_hispanic = weighted.mean(pct_hispanic, total_populationE, na.rm = TRUE),
)
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
kable(pattern_tbl,
align = c("l","l","l","l","l"),
col.names = c("Severity", "Tracts Number", "Average Population", "Average White Proportion", "Average Black Proportion", "Average Hispanic Proportion"),
caption = "Pattern of tarcts with different severity of MOE issues")| Severity | Tracts Number | Average Population | Average White Proportion | Average Black Proportion | Average Hispanic Proportion |
|---|---|---|---|---|---|
| High | 19 | 2989.684 | 74.18844 | 0.813323 | 17.55158 |
| Moderate | 3 | 3580.000 | 80.63315 | 1.648045 | 10.55866 |
Pattern Analysis: The results suggest that demographic data quality issues are not randomly distributed across tracts. Tracts with high demographic MOE tend to have smaller average populations and differ in racial and ethnic composition compared to tracts with more reliable data. These patterns likely reflect lower survey sample sizes and higher population variability in certain communities, which increases uncertainty in ACS estimates.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: Counties and census tracts with smaller populations and uneven demographic distributions tend to exhibit higher margins of error, particularly for income and race/ethnicity variables. At the tract level, demographic MOE issues were widespread across the selected counties, showing that uncertainty is concentrated in specific types of communities. 2. Equity Assessment: Communities most vulnerable to algorithmic bias are those with less reliable demographic and income estimates, especially smaller-population tracts and areas with minority populations represented by relatively small counts. Because algorithmic decision-making systems often rely on these variables to allocate resources, it increases the risk of misclassification. 3. Root Cause Analysis: The primary drivers of data quality issues are structural features of survey-based data collection, including limited sample sizes, population sparsity, and high variability within small geographic units. 4. Strategic Recommendations: To mitigate these risks, departments should incorporate data quality metrics, such as margins of error, directly into analytic rather than relying solely on point estimates.Also, decision making systems should apply caution thresholds in high-uncertainty areas
Executive Summary:
Counties and census tracts with smaller populations and uneven/ more POC demographic distributions tend to exhibit higher margins of error, particularly for income and race and ethnicity variables. At the tract level, demographic MOE issues were widespread across the selected counties, indicating that uncertainty is concentrated in specific types of communities rather than being randomly distributed. As a result, communities most vulnerable to algorithmic bias are those with less reliable demographic and income estimates, especially smaller-population tracts and areas where minority populations are represented by relatively small counts. Because algorithmic decision-making systems often rely on these variables to allocate resources or classify need, high levels of uncertainty increase the risk of misclassification and inequitable outcomes. These data quality challenges are primarily driven by structural features of survey-based data collection, including limited sample sizes, population sparsity, and high variability within small geographic units. To mitigate these risks, departments should incorporate data quality metrics, such as margins of error, directly into analytic workflows rather than relying solely on point estimates, and apply caution thresholds when making decisions in high-uncertainty areas.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
algorithm_recommendations <- income_reliability %>%
select(
county = NAME,
median_income = median_incomeE,
moe_percent = income_moe_pct,
reliability_category = income_reliability
) %>%
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
mutate(
algorithm_recommendation = case_when(
reliability_category == "High Confidence" ~ "Safe for algorithmic decisions",
reliability_category == "Moderate Confidence" ~ "Use with caution – monitor outcomes",
reliability_category == "Low Confidence" ~ "Requires manual review or additional data"
),
median_income = round(median_income, 0),
moe_percent = round(moe_percent, 2)
)
algorithm_recommendations %>%
kable(
col.names = c(
"County",
"Median Income",
"MOE (%)",
"Reliability Category",
"Algorithm Recommendation"
),
caption = "Decision Framework for Algorithm Implementation Based on Data Reliability"
)| County | Median Income | MOE (%) | Reliability Category | Algorithm Recommendation |
|---|---|---|---|---|
| Alameda | 122488 | 1.00 | High Confidence | Safe for algorithmic decisions |
| Alpine | 101125 | 17.25 | Low Confidence | Requires manual review or additional data |
| Amador | 74853 | 8.08 | Moderate Confidence | Use with caution – monitor outcomes |
| Butte | 66085 | 3.42 | High Confidence | Safe for algorithmic decisions |
| Calaveras | 77526 | 5.00 | High Confidence | Safe for algorithmic decisions |
| Colusa | 69619 | 8.25 | Moderate Confidence | Use with caution – monitor outcomes |
| Contra Costa | 120020 | 1.25 | High Confidence | Safe for algorithmic decisions |
| Del Norte | 61149 | 7.16 | Moderate Confidence | Use with caution – monitor outcomes |
| El Dorado | 99246 | 3.36 | High Confidence | Safe for algorithmic decisions |
| Fresno | 67756 | 1.43 | High Confidence | Safe for algorithmic decisions |
| Glenn | 64033 | 6.19 | Moderate Confidence | Use with caution – monitor outcomes |
| Humboldt | 57881 | 3.68 | High Confidence | Safe for algorithmic decisions |
| Imperial | 53847 | 4.11 | High Confidence | Safe for algorithmic decisions |
| Inyo | 63417 | 8.60 | Moderate Confidence | Use with caution – monitor outcomes |
| Kern | 63883 | 2.07 | High Confidence | Safe for algorithmic decisions |
| Kings | 68540 | 3.29 | High Confidence | Safe for algorithmic decisions |
| Lake | 56259 | 4.34 | High Confidence | Safe for algorithmic decisions |
| Lassen | 59515 | 5.97 | Moderate Confidence | Use with caution – monitor outcomes |
| Los Angeles | 83411 | 0.53 | High Confidence | Safe for algorithmic decisions |
| Madera | 73543 | 3.87 | High Confidence | Safe for algorithmic decisions |
| Marin | 142019 | 2.89 | High Confidence | Safe for algorithmic decisions |
| Mariposa | 60021 | 8.82 | Moderate Confidence | Use with caution – monitor outcomes |
| Mendocino | 61335 | 3.58 | High Confidence | Safe for algorithmic decisions |
| Merced | 64772 | 3.31 | High Confidence | Safe for algorithmic decisions |
| Modoc | 54962 | 9.80 | Moderate Confidence | Use with caution – monitor outcomes |
| Mono | 82038 | 18.76 | Low Confidence | Requires manual review or additional data |
| Monterey | 91043 | 2.09 | High Confidence | Safe for algorithmic decisions |
| Napa | 105809 | 2.82 | High Confidence | Safe for algorithmic decisions |
| Nevada | 79395 | 4.82 | High Confidence | Safe for algorithmic decisions |
| Orange | 109361 | 0.81 | High Confidence | Safe for algorithmic decisions |
| Placer | 109375 | 1.70 | High Confidence | Safe for algorithmic decisions |
| Plumas | 67885 | 11.45 | Low Confidence | Requires manual review or additional data |
| Riverside | 84505 | 1.26 | High Confidence | Safe for algorithmic decisions |
| Sacramento | 84010 | 0.97 | High Confidence | Safe for algorithmic decisions |
| San Benito | 104451 | 5.23 | Moderate Confidence | Use with caution – monitor outcomes |
| San Bernardino | 77423 | 1.04 | High Confidence | Safe for algorithmic decisions |
| San Diego | 96974 | 1.02 | High Confidence | Safe for algorithmic decisions |
| San Francisco | 136689 | 1.43 | High Confidence | Safe for algorithmic decisions |
| San Joaquin | 82837 | 1.75 | High Confidence | Safe for algorithmic decisions |
| San Luis Obispo | 90158 | 2.56 | High Confidence | Safe for algorithmic decisions |
| San Mateo | 149907 | 1.75 | High Confidence | Safe for algorithmic decisions |
| Santa Barbara | 92332 | 2.05 | High Confidence | Safe for algorithmic decisions |
| Santa Clara | 153792 | 1.00 | High Confidence | Safe for algorithmic decisions |
| Santa Cruz | 104409 | 3.04 | High Confidence | Safe for algorithmic decisions |
| Shasta | 68347 | 3.63 | High Confidence | Safe for algorithmic decisions |
| Sierra | 61108 | 15.12 | Low Confidence | Requires manual review or additional data |
| Siskiyou | 53898 | 4.90 | High Confidence | Safe for algorithmic decisions |
| Solano | 97037 | 1.78 | High Confidence | Safe for algorithmic decisions |
| Sonoma | 99266 | 2.00 | High Confidence | Safe for algorithmic decisions |
| Stanislaus | 74872 | 1.83 | High Confidence | Safe for algorithmic decisions |
| Sutter | 72654 | 4.71 | High Confidence | Safe for algorithmic decisions |
| Tehama | 59029 | 6.95 | Moderate Confidence | Use with caution – monitor outcomes |
| Trinity | 47317 | 12.45 | Low Confidence | Requires manual review or additional data |
| Tulare | 64474 | 2.31 | High Confidence | Safe for algorithmic decisions |
| Tuolumne | 70432 | 6.66 | Moderate Confidence | Use with caution – monitor outcomes |
| Ventura | 102141 | 1.50 | High Confidence | Safe for algorithmic decisions |
| Yolo | 85097 | 2.74 | High Confidence | Safe for algorithmic decisions |
| Yuba | 66693 | 4.19 | High Confidence | Safe for algorithmic decisions |
# Format as a professional table with kable()Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: Counties with high-confidence income estimates (MOE < 5%) are appropriate for immediate algorithmic use because the underlying median income values are relatively stable and less likely to shift due to sampling noise. Based on the results, these counties include: Alameda, Butte, Calaveras, Contra Costa, El Dorado, Fresno, Humboldt, Imperial, Kern, Kings, Lake, Los Angeles, Madera, Marin, Mendocino, Merced, Monterey, Napa, Nevada, Orange, Placer, Riverside, Sacramento, San Bernardino, San Diego, San Francisco, San Joaquin, San Luis Obispo, San Mateo, Santa Barbara, Santa Clara, Santa Cruz, Shasta, Siskiyou, Solano, Sonoma, Stanislaus, Sutter, Tulare, Ventura, Yolo, and Yuba. In these places, automated prioritization or eligibility scoring based on income is less likely to systematically misclassify residents due to measurement error.
Counties requiring additional oversight: Counties with moderate-confidence estimates (MOE 5–10%) can still use algorithmic tools, but only with monitoring because uncertainty is high enough that rankings and thresholds may be sensitive to small changes in the estimate. These counties include: Amador, Colusa, Del Norte, Glenn, Inyo, Lassen, Mariposa, Modoc, San Benito, Tehama, and Tuolumne. For these counties, the department should track outcomes for false positives or false negatives, and periodically re-run the analysis when updated ACS releases become available
Counties needing alternative approaches: Counties with low-confidence estimates (MOE > 10%) should not rely on automated income-based decisions without added safeguards, because the uncertainty is large enough to meaningfully distort classifications. These counties include Alpine, Mono, Plumas, Sierra, and Trinity. In these areas, the department should manual review for any high-stakes decisions or aggregating data to a larger geography to reduce uncertainty ## Questions for Further Investigation Does high MOE cluster spatially and does that pattern remain stable across different ACS years? Do certain demographic contexts consistently show higher uncertainty at the tract level? And how should algorithms adjust such thresholds?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on January 31, 2026
Reproducibility: - All analysis conducted in R version 2025.09.0 + 387 - Census API key required for replication - Complete code and documentation available at: cocoz1yn.github.io/PPA_cz/
Methodology Notes: This analysis uses 2022 American Community Survey 5-year estimates to ensure consistent coverage at both the county and census tract levels. County selection for tract-level analysis was based on data reliability categories derived from income margin-of-error percentages, with counties chosen to represent different confidence levels. All margin-of-error percentages were calculated using the ratio of the ACS margin of error to the corresponding estimate, and results were classified using consistent thresholds across geographies. Data processing and summarizing were conducted using reproducible R workflows, including standardized variable naming, wide-format ACS outputs, and explicit handling of margins of error.
Limitations: Several limitations should be considered when interpreting the results. First, ACS estimates are based on survey samples rather than complete population counts, which introduces inherent uncertainty that is amplified at smaller geographic scales such as census tracts. Second, the use of a single ACS release limits the ability to assess temporal trends or changes in data reliability over time. Third, tract-level analyses were restricted to a subset of counties, which may limit the generalization of observed patterns across the entire state. Finally, margin of error thresholds used to classify data reliability represent analytic choices and may influence how counties and tracts are categorized.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html