---
title: "Bikeshare Space-Time Prediction"
subtitle: "CPLN 5080 - Spring 2026"
author: "Coco Zhou"
date: April 14, 2016
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# Introduction
This report focuses on comparing Q1 and Q3 ridership patterns, with particular attention to Q3 as the higher-demand season. I chose these quarters to examine how model performance changes across different seasonal conditions, especially between a lower-demand winter period and a busier summer period. This comparison is useful because bike-share demand is not constant throughout the year as summer ridership tends to be higher and more variable, while winter demand is generally lower and more stable. By analyzing both quarters, the report can better assess whether the models perform consistently across seasons and whether some periods are more difficult to predict than others.
# SET UP
```{r setup}
#| message: false
#| warning: false
library(tidyverse)
library(lubridate)
library(sf)
library(tigris)
library(tidycensus)
library(riem) # For Philadelphia weather from ASOS stations
library(viridis) # color scales for visualization
library(knitr)
library(kableExtra)
library(patchwork)
library(here)
options(scipen = 999) # Get rid of scientific notation to make it look tidier
```
```{r themes}
plotTheme <- theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.grid.major = element_line(colour = "#D0D0D0", linewidth = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme_void() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, "cm"),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
```{r set-census-api-key}
census_api_key(Sys.getenv("CENSUS_API_KEY"))
#Use the following code if you're not able to access the key in your environment
#census_api_key("yourkeyhere", install=TRUE, overwrite=TRUE)
```
# IMPORT AND PREP DATA Q3
## Load Indego Trip Data (Q2 2025)
```{r load_indego}
# Read Q3 2025 data
indego <- read_csv("data/indego-trips-2025-q3.csv")
# Quick look at the data
glimpse(indego)
```
## Examine the Data Structure
```{r explore_data}
# How many trips?
cat("Total trips in Q3 2025:", nrow(indego), "\n")
# Date range
cat("Date range:",
min(mdy_hm(indego$start_time)), "to",
max(mdy_hm(indego$start_time)), "\n")
# How many unique stations?
cat("Unique start stations:", length(unique(indego$start_station)), "\n")
# Trip types
table(indego$trip_route_category)
# Passholder types
table(indego$passholder_type)
# Bike types
table(indego$bike_type)
```
## Create Time Bins
```{r create-time-bins}
indego <- indego %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))
```
#EDA OF INDEGO DATA
## Trips Over Time
```{r trips_over_time}
#| message: false
#| warning: false
# Daily trip counts
daily_trips <- indego %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q3 2025",
subtitle = "Spring demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
```
During Summer 2025, Indego bike-share ridership in Philadelphia stayed relatively high and showed an overall upward trend. Although daily trips fluctuated a lot, with regular peaks and drops that suggest a weekly pattern, the red dashed trend line indicates that ridership generally increased from just above 4,000 trips in early July to around 5,600 by late September. Overall, the graph suggests growing demand for bike-share over the summer despite short-term variation.
```{r}
fly_eagles_fly <- daily_trips %>% filter(date == "2025-09-01")
typical_boring_monday <- indego %>%
filter(dotw == "Mon", date != "2025-09-01") %>%
group_by(date) %>%
summarize(trips = n()) %>%
summarize(avg_monday_trips = mean(trips))
print(fly_eagles_fly)
print(typical_boring_monday)
```
On September 1, 2025, Indego recorded 4,664 trips, compared with an average of 4,962 trips for other Mondays, so ridership was about 298 trips lower than usual, or roughly 6% below the typical Monday level. This suggests that bike-share use was slightly reduced on that day, which makes sense because September 1, 2025 was Labor Day, a national holiday. Even though holidays can increase recreational riding for some people, they often reduce regular commuting trips, which likely pulled total ridership below the usual Monday average.
## Hourly Patterns
```{r hourly_patterns}
# Average trips by hour and day type
hourly_patterns <- indego %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
```
## Top Stations
```{r top_stations}
# Most popular origin stations
top_stations <- indego %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations,
caption = "Top 20 Indego Stations by Trip Origins",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
# PUT DATA IN CONTEXT
## Load Philadelphia Census Data
```{r load_census}
#| message: false
#| warning: false
#| results: 'hide'
# Get Philadelphia census tracts
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) #transforming to WGS84 for lat/lon matching, currently in NAD83
# Check the data
glimpse(philly_census)
```
## Map Philadelphia Context
```{r map_philly}
# Map median income
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Context for understanding bike share demand patterns"
) +
# Stations
geom_point(
data = indego,
aes(x = start_lon, y = start_lat),
color = "red", size = 0.25, alpha = 0.6
) +
mapTheme
```
## Join Census Data to Stations
```{r join_census_to_stations}
# Create sf object for stations
stations_sf <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_census <- indego %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Create the map showing problem stations
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar,
na.value = "grey90"
) +
# Stations with census data (small grey dots)
geom_point(
data = stations_for_map %>% filter(has_census),
aes(x = start_lon, y = start_lat),
color = "grey30", size = 1, alpha = 0.6
) +
# Stations WITHOUT census data (red X marks the spot)
geom_point(
data = stations_for_map %>% filter(!has_census),
aes(x = start_lon, y = start_lat),
color = "red", size = 1, shape = 4, stroke = 1.5
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Indego stations shown (RED = no census data match)",
caption = "Red X marks indicate stations that didn't join to census tracts"
) +
mapTheme
```
## Dealing with missing data
```{r}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
## Get Weather Data
```{r get_weather}
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q3 2024: July 1 - September 30
weather_data <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-07-01",
date_end = "2025-09-30"
)
# Process weather data
weather_processed <- weather_data %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct()
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
## Visualize Weather Patterns
```{r visualize-weather}
#| message: false
#| warning: false
#|
ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q3 2025",
subtitle = "Winter to early spring transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
```
# CREATE SPACE-TIME PANEL
## Aggregate Trips to Station-Hour Level
```{r aggregate-trips}
#| message: false
#| warning: false
# Count trips by station-hour
trips_panel <- indego_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# How many station-hour observations?
nrow(trips_panel)
# How many unique stations?
length(unique(trips_panel$start_station))
# How many unique hours?
length(unique(trips_panel$interval60))
```
## Create Complete Panel Structure
```{r complete-panel}
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
```
```{r}
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
# Join trip counts
left_join(trips_panel, by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
```
## Add Time Features
```{r add_time_features}
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
```
## Join Weather Data
```{r join_weather}
study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))
```
# CREATE TEMPORAL LAG VARIABLES
```{r create_lags}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")
```
## Visualize Lag Correlations
```{r lag_correlations}
# Sample one station to visualize
example_station <- study_panel_complete %>%
filter(start_station == first(start_station)) %>%
head(168) # One week
# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
scale_color_manual(values = c(
"Current" = "#08519c",
"1 Hour Ago" = "#3182bd",
"24 Hours Ago" = "#6baed6"
)) +
labs(
title = "Temporal Lag Patterns at One Station",
subtitle = "Past demand predicts future demand",
x = "Date-Time",
y = "Trip Count",
color = "Time Period"
) +
plotTheme
```
# TEMPORAL TRAIN/TEST SPLIT
```{r temporal-split}
# Split by week
# Q3 has weeks 27-40 (July-Spet)
# Train on weeks 27-35
# Test on weeks 36-40
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 36)
test <- study_panel_complete %>%
filter(week >= 36)
cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
```
#BUILD PREDICTIVE MODELS
## Model 1: Baseline (Time + Weather)
```{r model1}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
# Now run the model
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
summary(model1)
```
## Model 2: Add Temporal Lags
```{r model2}
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
summary(model2)
```
## Model 3: Add Demographics
```{r model3}
train <- train %>%
mutate(
Med_Inc = coalesce(Med_Inc.x, Med_Inc.y),
Percent_Taking_Transit = coalesce(Percent_Taking_Transit.x, Percent_Taking_Transit.y),
Percent_White = coalesce(Percent_White.x, Percent_White.y)
)
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White,
data = train
)
summary(model3)
```
## Model 4: Add Station Fixed Effects
```{r model4}
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White +
as.factor(start_station),
data = train
)
# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")
cat("Model 4 Adj R-squared:", summary(model4)$adj.r.squared, "\n")
```
## Model 5: Add Rush Hour Interaction
```{r model5}
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc + Percent_Taking_Transit + Percent_White +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")
cat("Model 5 Adj R-squared:", summary(model5)$adj.r.squared, "\n")
```
# MODEL EVALUATION
## Calculate Predictions and MAE
```{r calculate_mae}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
test <- test %>%
mutate(
Med_Inc = coalesce(Med_Inc.x, Med_Inc.y),
Percent_Taking_Transit = coalesce(Percent_Taking_Transit.x, Percent_Taking_Transit.y),
Percent_White = coalesce(Percent_White.x, Percent_White.y)
)
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
## Visualize Model Comparison
```{r compare_models}
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#3182bd", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
# Compare Results to Q1 2025
## MAE Comparison
```{r}
## MAE comparison
q1_mae <- c(0.60, 0.50, 0.74, 0.73, 0.73) ## Matrics from in-class exercise
mae_compare <- data.frame(
Model = mae_results$Model,
Q1_MAE = q1_mae,
Q3_MAE = mae_results$MAE
)
kable(
mae_compare,
digits = 2,
caption = "Mean Absolute Error by Model: Q1 vs Q3 (Test Set)",
col.names = c("Model", "Q1_MAE (trips)", "Q3_MAE (trips)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Overall, Model 2, which adds temporal lags, performs best because it has the lowest MAE in Q1 and remains tied for the lowest MAE in Q3. Model 1 has the highest error in both quarters, showing that time and weather alone are less effective at predicting ridership. The results also show that errors are generally higher in Q3 than in Q1, suggesting that summer bike-share demand is harder to predict than winter demand. This likely reflects greater variability in summer travel behavior, while the strong performance of Model 2 suggests that recent trip patterns remain one of the most useful predictors across seasons.
## Temporal Patterns Comparison (summer vs. winter)
```{r}
## Comparison between Q1 2025 and Q3 2024 (summer vs. winter)
library(ggplot2)
library(png)
library(grid)
library(cowplot)
img1 <- png::readPNG("data/daily_ridership_Q3.png")
p_left <- grid::rasterGrob(img1, interpolate = TRUE)
img2 <- png::readPNG("data/daily_ridership_Q1.png")
p_right <- grid::rasterGrob(img2, interpolate = TRUE)
cowplot::plot_grid(
p_left,
p_right,
labels = c("summer vs. winter"),
ncol = 2
)
```
Summer and winter show clearly different temporal ridership patterns. In summer, daily trips stay consistently high, with strong weekday–weekend fluctuations and a pattern that rises through mid-season before gradually declining toward October. In contrast, winter begins at much lower levels and shows a steady upward trend as temperatures warm, with ridership recovering from January lows and climbing sharply into March and April. Overall, summer displays high, stable, and highly active demand, while winter reflects low but steadily increasing activity, capturing a clear seasonal effect in bike-share usage.
## Important Features
```{r}
# Use partial R square to determine the most important feature
library(rsq)
rsq.partial(model2)
```
In both Q1 and Q3, the one-hour lag (lag1Hour) is by far the most important predictor, followed by hour-of-day and the one-day lag. Weather and day-of-week contribute very little once temporal patterns and lags are included, and their independent explanatory power is even smaller in Q3.
# SPACE-TIME ERROR ANALYSIS
## Observed vs. Predicted
```{r obs_vs_pred}
#| message: false
#| warning: false
test <- test %>%
mutate(
error = Trip_Count - pred2, #best result model (change if ness)
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#3182bd") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips",
subtitle = "Model 2 performance by time period",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme
```
## Spatial Error Patterns
```{r spatial_errors}
# Calculate MAE by station
station_errors <- test %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)
# Calculate station errors
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = MAE),
size = 0.8,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE (trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5),
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors") +
mapTheme +
theme(
legend.position = "bottom",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
guides(color = guide_colorbar(
barwidth = 12,
barheight = 1,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
size = 0.8,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg Demand (trips/hour)",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand") +
mapTheme +
theme(
legend.position = "bottom",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
guides(color = guide_colorbar(
barwidth = 12,
barheight = 1,
title.position = "top",
title.hjust = 0.5
))
library(gridExtra)
# Combine
grid.arrange(
p1, p2,
ncol = 2
)
```
## Temporal Error Patterns
```{r temporal_errors}
# MAE by time of day and day type
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Errors and Demographics
```{r errors_demographics}
#| message: false
#| warning: false
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
plotTheme
(p1 | p2) / (p3 | plot_spacer())
```
Errors tend to be slightly higher in higher-income and higher-percent-White neighborhoods, while they are lower in areas with high transit usage. This suggests the model struggles more in wealthier, less transit-oriented areas. The equity implication is that model performance is not uniform—some communities may receive less accurate predictions than others, which could affect planning decisions if not addressed.
# Feature Engineering
## Spatial features
```{r}
library(geosphere)
## Distance to Center City
# Center City coordinate (Use Philadelphia City Hall)
center_lat <- 39.952800
center_lon <- -75.163500
# Calculate distance in kilometers
study_panel_complete <- study_panel_complete %>%
mutate(
dist_to_center = distHaversine(
cbind(start_lon.y, start_lat.y),
cbind(center_lon, center_lat)
) / 1000 # convert meters to km
)
```
## Trip history features
```{r}
## Rolling 7-day average demand
library(slider)
study_panel_complete <- study_panel_complete %>%
arrange(start_station, date, hour) %>%
group_by(start_station) %>%
mutate(
rolling7 = slide_dbl(
Trip_Count,
mean,
.before = 168,
.complete = TRUE
)
) %>%
ungroup()
```
## Split train/test Dataset
```{r}
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 36)
test <- study_panel_complete %>%
filter(week >= 36)
cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")
cat("Training date range:", min(train$date), "to", max(train$date), "\n")
cat("Testing date range:", min(test$date), "to", max(test$date), "\n")
```
## Add two new presictors into the best model
```{r}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
model_n <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + dist_to_center + rolling7,
data = train
)
summary(model_n)
```
## New Model Evaluation and MAE
```{r}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred_n = predict(model_n, newdata = test)
)
# Calculate MAE for each model
mae_results_new <- data.frame(
Model = c(
"2. + Temporal Lags",
"New Model. + Rolling 7-day average demand and Distance to Center City",
"New Poisson Model"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred_n), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred_pois), na.rm = TRUE)
),
RMSE = c(
sqrt(mean((test$Trip_Count - test$pred2)^2, na.rm = TRUE)),
sqrt(mean((test$Trip_Count - test$pred_n)^2, na.rm = TRUE)),
sqrt(mean((test$Trip_Count - test$pred_pois)^2, na.rm = TRUE))
)
)
kable(mae_results_new,
digits = 5,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)", "RMSE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The two new models perform similarly, while the Temporal Lags model cannot be evaluated because its MAE and RMSE are NaN. Of the valid models, the New Poisson Model has the lower MAE, so it is slightly better on average, while the rolling 7-day average model has the lower RMSE, suggesting more stable predictions with fewer large errors.
## Poisson model
```{r}
## Train a poisson model
model_poisson <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + dist_to_center + rolling7,
family = poisson(link = "log"),
data = train
)
summary(model_poisson)
```
```{r}
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) /
model_poisson$df.residual
cat("Dispersion parameter:", round(dispersion, 2), "\n")
```
```{r}
cat("Rule of thumb: >1.5 suggests overdispersion\n")
```
```{r}
if (dispersion > 1.5) {
cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
cat("✓ Dispersion looks okay for Poisson model.\n")
}
```
## Evaluation of Poisson Model (MAE and RMSE)
```{r}
test <- test %>%
mutate(
pred_pois = predict(model_poisson, newdata = test, type = "response")
)
# Calculate MAE for each model
mae_results_nn <- data.frame(
Model = c(
"2. + Temporal Lags",
"New Model. + Rolling 7-day average demand and Distance to Center City",
"New Poisson Model"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred_n), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred_pois), na.rm = TRUE)
),
RMSE = c(
sqrt(mean((test$Trip_Count - test$pred2)^2, na.rm = TRUE)),
sqrt(mean((test$Trip_Count - test$pred_n)^2, na.rm = TRUE)),
sqrt(mean((test$Trip_Count - test$pred_pois)^2, na.rm = TRUE))
)
)
kable(mae_results_nn,
digits = 5,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)", "RMSE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Between the two valid models, the New Poisson Model has the lower MAE, so it performs slightly better on average, while the rolling 7-day average model has the lower RMSE, suggesting fewer large prediction errors. Overall, the Poisson model is slightly better in average accuracy, but the rolling-average model appears more stable.
# Critical Reflection
Overall, the final model performance is reasonably strong, with MAE values generally below 1 trip, but the remaining error is still meaningful enough that Indego should be cautious about relying on the model alone for operational decisions. Across our results, adding temporal lags clearly improved prediction accuracy, and the newer models with rolling demand and Poisson specification performed competitively, with the Poisson model showing slightly better average accuracy and the rolling-average model appearing somewhat more stable against larger errors. At the same time, the model comparisons suggest that bike-share demand remains harder to predict in higher-demand periods, especially in summer, when ridership is more variable. In practice, this means the model is useful as a decision-support tool for identifying broad demand patterns and guiding planning priorities, but it should not be treated as the sole basis for real-time rebalancing or hourly truck dispatch. Live system monitoring, staff judgment, and awareness of unusual conditions should still play an important role.
The analysis also shows that the model cannot fully capture all sources of variation in ridership. Factors such as holidays, special events, sudden weather shifts, and other short-term disruptions are only partially reflected in the current specification, even after adding weather, lags, and demographic variables. The seasonal comparison also suggests that relationships are not equally stable across time, since prediction error tends to rise in Q3 relative to Q1. With more time and data, the model could be improved by adding richer temporal and spatial predictors, such as event calendars, more detailed built-environment and transit accessibility measures, longer ridership histories, and possibly more flexible modeling approaches that can better capture nonlinear or changing demand patterns.
From an equity perspective, these modeling limits matter because prediction errors are not just statistical problems; they can translate into uneven service quality across places and groups of riders. If the model is less accurate in high-demand or highly variable areas, then relying too heavily on it could unintentionally reinforce service gaps by sending too few bikes where demand is strongest or by misallocating resources across neighborhoods. To reduce this risk, Indego should regularly evaluate model performance across different station areas and community contexts, not just overall accuracy. Any operational use of the model should also be paired with equity goals, such as minimum service standards or explicit checks on whether prediction errors are disproportionately affecting particular neighborhoods. Used this way, the model can support more efficient planning while still protecting fair access to the bike-share system.