Bikeshare Space-Time Prediction

CPLN 5080 - Spring 2026

Author

Coco Zhou

Published

April 14, 2016

Introduction

This report focuses on comparing Q1 and Q3 ridership patterns, with particular attention to Q3 as the higher-demand season. I chose these quarters to examine how model performance changes across different seasonal conditions, especially between a lower-demand winter period and a busier summer period. This comparison is useful because bike-share demand is not constant throughout the year as summer ridership tends to be higher and more variable, while winter demand is generally lower and more stable. By analyzing both quarters, the report can better assess whether the models perform consistently across seasons and whether some periods are more difficult to predict than others.

SET UP

Code
library(tidyverse)
library(lubridate)
library(sf)
library(tigris)
library(tidycensus)
library(riem)  # For Philadelphia weather from ASOS stations

library(viridis) # color scales for visualization
library(knitr)
library(kableExtra)
library(patchwork)


library(here)
options(scipen = 999) # Get rid of scientific notation to make it look tidier
Code
plotTheme <- theme_minimal() +
  theme(
    plot.title    = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10),
    plot.caption  = element_text(size = 8),
    axis.text.x   = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y   = element_text(size = 10),
    axis.title    = element_text(size = 11, face = "bold"),
    panel.grid.major = element_line(colour = "#D0D0D0", linewidth = 0.2),
    panel.grid.minor = element_blank(),
    axis.ticks    = element_blank(),
    legend.position = "right"
  )

mapTheme <- theme_void() +
  theme(
    plot.title    = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10),
    plot.caption  = element_text(size = 8),
    legend.position  = "right",
    plot.margin      = margin(1, 1, 1, 1, "cm"),
    legend.key.height = unit(1, "cm"),
    legend.key.width  = unit(0.2, "cm")
  )

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
Code
census_api_key(Sys.getenv("CENSUS_API_KEY"))

#Use the following code if you're not able to access the key in your environment
#census_api_key("yourkeyhere", install=TRUE, overwrite=TRUE)

IMPORT AND PREP DATA Q3

Load Indego Trip Data (Q2 2025)

Code
# Read Q3 2025 data
indego <- read_csv("data/indego-trips-2025-q3.csv")

# Quick look at the data
glimpse(indego)
Rows: 465,464
Columns: 15
$ trip_id             <dbl> 1203073573, 1203073759, 1203073753, 1203073877, 12…
$ duration            <dbl> 10, 19, 16, 19, 2, 6, 20, 4, 52, 3, 31, 9, 10, 9, …
$ start_time          <chr> "7/1/2025 0:06", "7/1/2025 0:11", "7/1/2025 0:13",…
$ end_time            <chr> "7/1/2025 0:16", "7/1/2025 0:30", "7/1/2025 0:29",…
$ start_station       <dbl> 3163, 3304, 3394, 3207, 3061, 3320, 3201, 3078, 32…
$ start_lat           <dbl> 39.94974, 39.94234, 39.92400, 39.95441, 39.95425, …
$ start_lon           <dbl> -75.18097, -75.15399, -75.16959, -75.19200, -75.17…
$ end_station         <dbl> 3374, 3315, 3394, 3368, 3156, 3320, 3196, 3235, 32…
$ end_lat             <dbl> 39.97280, 39.94235, 39.92400, 39.97647, 39.95381, …
$ end_lon             <dbl> -75.17971, -75.21138, -75.16959, -75.17362, -75.17…
$ bike_id             <chr> "31674", "29473", "27889", "31825", "25400", "3172…
$ plan_duration       <dbl> 30, 1, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,…
$ trip_route_category <chr> "One Way", "One Way", "Round Trip", "One Way", "On…
$ passholder_type     <chr> "Indego30", "Walk-up", "Indego30", "Indego30", "In…
$ bike_type           <chr> "electric", "electric", "electric", "electric", "e…

Examine the Data Structure

Code
# How many trips?
cat("Total trips in Q3 2025:", nrow(indego), "\n")
Total trips in Q3 2025: 465464 
Code
# Date range
cat("Date range:", 
    min(mdy_hm(indego$start_time)), "to", 
    max(mdy_hm(indego$start_time)), "\n")
Date range: 1751328360 to 1759276680 
Code
# How many unique stations?
cat("Unique start stations:", length(unique(indego$start_station)), "\n")
Unique start stations: 284 
Code
# Trip types
table(indego$trip_route_category)

   One Way Round Trip 
    434518      30946 
Code
# Passholder types
table(indego$passholder_type)

 Day Pass  Indego30 Indego365      NULL   Walk-up 
    18418    252897    157695         5     36449 
Code
# Bike types
table(indego$bike_type)

electric standard 
  299515   165949 

Create Time Bins

Code
indego <- indego %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))
# A tibble: 6 × 6
  start_datetime      interval60           week dotw   hour weekend
  <dttm>              <dttm>              <dbl> <ord> <int>   <dbl>
1 2025-07-01 00:06:00 2025-07-01 00:00:00    26 Tue       0       0
2 2025-07-01 00:11:00 2025-07-01 00:00:00    26 Tue       0       0
3 2025-07-01 00:13:00 2025-07-01 00:00:00    26 Tue       0       0
4 2025-07-01 00:16:00 2025-07-01 00:00:00    26 Tue       0       0
5 2025-07-01 00:16:00 2025-07-01 00:00:00    26 Tue       0       0
6 2025-07-01 00:18:00 2025-07-01 00:00:00    26 Tue       0       0

#EDA OF INDEGO DATA

Trips Over Time

Code
# Daily trip counts
daily_trips <- indego %>%
  group_by(date) %>%
  summarize(trips = n())

ggplot(daily_trips, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q3 2025",
    subtitle = "Spring demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

During Summer 2025, Indego bike-share ridership in Philadelphia stayed relatively high and showed an overall upward trend. Although daily trips fluctuated a lot, with regular peaks and drops that suggest a weekly pattern, the red dashed trend line indicates that ridership generally increased from just above 4,000 trips in early July to around 5,600 by late September. Overall, the graph suggests growing demand for bike-share over the summer despite short-term variation.

Code
fly_eagles_fly <- daily_trips %>% filter(date == "2025-09-01")

typical_boring_monday <- indego %>%
  filter(dotw == "Mon", date != "2025-09-01") %>%
  group_by(date) %>%
  summarize(trips = n()) %>%
  summarize(avg_monday_trips = mean(trips))

print(fly_eagles_fly)
# A tibble: 1 × 2
  date       trips
  <date>     <int>
1 2025-09-01  4664
Code
print(typical_boring_monday)
# A tibble: 1 × 1
  avg_monday_trips
             <dbl>
1            4962.

On September 1, 2025, Indego recorded 4,664 trips, compared with an average of 4,962 trips for other Mondays, so ridership was about 298 trips lower than usual, or roughly 6% below the typical Monday level. This suggests that bike-share use was slightly reduced on that day, which makes sense because September 1, 2025 was Labor Day, a national holiday. Even though holidays can increase recreational riding for some people, they often reduce regular commuting trips, which likely pulled total ridership below the usual Monday average.

Hourly Patterns

Code
# Average trips by hour and day type
hourly_patterns <- indego %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

Top Stations

Code
# Most popular origin stations
top_stations <- indego %>%
  count(start_station, start_lat, start_lon, name = "trips") %>%
  arrange(desc(trips)) %>%
  head(20)

kable(top_stations, 
      caption = "Top 20 Indego Stations by Trip Origins",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 20 Indego Stations by Trip Origins
start_station start_lat start_lon trips
3,010 39.94711 -75.16618 7,716
3,032 39.94527 -75.17971 5,884
3,213 39.93887 -75.16663 5,660
3,163 39.94974 -75.18097 5,174
3,359 39.94888 -75.16978 4,681
3,185 39.95169 -75.15888 4,670
3,028 39.94061 -75.14958 4,607
3,020 39.94855 -75.19007 4,556
3,022 39.95472 -75.18323 4,475
3,066 39.94561 -75.17348 4,353
3,059 39.96244 -75.16121 4,265
3,063 39.94633 -75.16980 4,230
3,161 39.95486 -75.18091 4,223
3,101 39.94295 -75.15955 4,183
3,054 39.96250 -75.17420 4,117
3,030 39.93935 -75.15716 3,986
3,362 39.94816 -75.16226 3,965
3,061 39.95425 -75.17761 3,716
3,057 39.96439 -75.17987 3,668
3,296 39.95134 -75.16758 3,665

PUT DATA IN CONTEXT

Load Philadelphia Census Data

Code
# Get Philadelphia census tracts
philly_census <- get_acs(
  geography = "tract",
  variables = c(
    "B01003_001",  # Total population
    "B19013_001",  # Median household income
    "B08301_001",  # Total commuters
    "B08301_010",  # Commute by transit
    "B02001_002",  # White alone
    "B25077_001"   # Median home value
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  geometry = TRUE,
  output = "wide"
) %>%
  rename(
    Total_Pop = B01003_001E,
    Med_Inc = B19013_001E,
    Total_Commuters = B08301_001E,
    Transit_Commuters = B08301_010E,
    White_Pop = B02001_002E,
    Med_Home_Value = B25077_001E
  ) %>%
  mutate(
    Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
    Percent_White = (White_Pop / Total_Pop) * 100
  ) %>%
  st_transform(crs = 4326)  #transforming to WGS84 for lat/lon matching, currently in NAD83

# Check the data
glimpse(philly_census)

Map Philadelphia Context

Code
# Map median income
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Context for understanding bike share demand patterns"
  ) +
  # Stations 
  geom_point(
    data = indego,
    aes(x = start_lon, y = start_lat),
    color = "red", size = 0.25, alpha = 0.6
  ) +
  mapTheme

Join Census Data to Stations

Code
# Create sf object for stations
stations_sf <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.

stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Add back to trip data
indego_census <- indego %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )


# Prepare data for visualization
stations_for_map <- indego %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Create the map showing problem stations
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar,
    na.value = "grey90"
  ) +
  # Stations with census data (small grey dots)
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(x = start_lon, y = start_lat),
    color = "grey30", size = 1, alpha = 0.6
  ) +
  # Stations WITHOUT census data (red X marks the spot)
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(x = start_lon, y = start_lat),
    color = "red", size = 1, shape = 4, stroke = 1.5
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Indego stations shown (RED = no census data match)",
    caption = "Red X marks indicate stations that didn't join to census tracts"
  ) +
  mapTheme

Dealing with missing data

Code
# Identify which stations to keep
valid_stations <- stations_census %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census <- indego %>%
  filter(start_station %in% valid_stations) %>%
  left_join(
    stations_census %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

Get Weather Data

Code
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q3 2024: July 1 - September 30
weather_data <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2025-07-01",
  date_end = "2025-09-30"
)

# Process weather data
weather_processed <- weather_data %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  select(interval60, Temperature, Precipitation, Wind_Speed) %>%
  distinct()

# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))
  Temperature    Precipitation        Wind_Speed   
 Min.   :57.00   Min.   :0.000000   Min.   : 0.00  
 1st Qu.:70.00   1st Qu.:0.000000   1st Qu.: 4.00  
 Median :76.00   Median :0.000000   Median : 6.00  
 Mean   :75.72   Mean   :0.008019   Mean   : 6.42  
 3rd Qu.:81.00   3rd Qu.:0.000000   3rd Qu.: 9.00  
 Max.   :98.00   Max.   :1.390000   Max.   :29.00  

Visualize Weather Patterns

Code
ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - Q3 2025",
    subtitle = "Winter to early spring transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

CREATE SPACE-TIME PANEL

Aggregate Trips to Station-Hour Level

Code
# Count trips by station-hour
trips_panel <- indego_census %>%
  group_by(interval60, start_station, start_lat, start_lon,
           Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()

# How many station-hour observations?
nrow(trips_panel)
[1] 214178
Code
# How many unique stations?
length(unique(trips_panel$start_station))
[1] 263
Code
# How many unique hours?
length(unique(trips_panel$interval60))
[1] 2208

Create Complete Panel Structure

Code
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
Expected panel rows: 580,704 
Code
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
Current rows: 214,178 
Code
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
Missing rows: 366,526 
Code
# Create complete panel
study_panel <- expand.grid(
  interval60 = unique(trips_panel$interval60),
  start_station = unique(trips_panel$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
  group_by(start_station) %>%
  summarize(
    start_lat = first(start_lat),
    start_lon = first(start_lon),
    Med_Inc = first(Med_Inc),
    Percent_Taking_Transit = first(Percent_Taking_Transit),
    Percent_White = first(Percent_White),
    Total_Pop = first(Total_Pop)
  )

study_panel <- study_panel %>%
  left_join(station_attributes, by = "start_station")

# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
Complete panel rows: 580,704 

Add Time Features

Code
study_panel <- study_panel %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )

Join Weather Data

Code
study_panel <- study_panel %>%
  left_join(weather_complete, by = "interval60")

# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))
   Trip_Count       Temperature    Precipitation  
 Min.   : 0.0000   Min.   :57.00   Min.   :0.000  
 1st Qu.: 0.0000   1st Qu.:70.00   1st Qu.:0.000  
 Median : 0.0000   Median :76.00   Median :0.000  
 Mean   : 0.7236   Mean   :75.72   Mean   :0.008  
 3rd Qu.: 1.0000   3rd Qu.:81.00   3rd Qu.:0.000  
 Max.   :22.0000   Max.   :98.00   Max.   :1.390  
                   NA's   :6312    NA's   :6312   

CREATE TEMPORAL LAG VARIABLES

Code
# Sort by station and time
study_panel <- study_panel %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel <- study_panel %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
  filter(!is.na(lag1day))

cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")
Rows after removing NA lags: 708,785 

Visualize Lag Correlations

Code
# Sample one station to visualize
example_station <- study_panel_complete %>%
  filter(start_station == first(start_station)) %>%
  head(168)  # One week

# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
  geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
  geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
  geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
  scale_color_manual(values = c(
    "Current" = "#08519c",
    "1 Hour Ago" = "#3182bd",
    "24 Hours Ago" = "#6baed6"
  )) +
  labs(
    title = "Temporal Lag Patterns at One Station",
    subtitle = "Past demand predicts future demand",
    x = "Date-Time",
    y = "Trip Count",
    color = "Time Period"
  ) +
  plotTheme

TEMPORAL TRAIN/TEST SPLIT

Code
# Split by week
# Q3 has weeks 27-40 (July-Spet)
# Train on weeks 27-35 
# Test on weeks 36-40 

# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
  filter(week < 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)


# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# NOW create train/test split
train <- study_panel_complete %>%
  filter(week < 36)

test <- study_panel_complete %>%
  filter(week >= 36)

cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
Training observations: 501,541 

#BUILD PREDICTIVE MODELS

Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

# Now run the model
model1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train
)

summary(model1)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7125 -0.7561 -0.2121  0.2110 21.1067 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.4115692  0.0261453  15.742 < 0.0000000000000002 ***
as.factor(hour)1  -0.0540262  0.0117895  -4.583    0.000004593990039 ***
as.factor(hour)2  -0.1261545  0.0119723 -10.537 < 0.0000000000000002 ***
as.factor(hour)3  -0.1774976  0.0119619 -14.839 < 0.0000000000000002 ***
as.factor(hour)4  -0.1673862  0.0118130 -14.170 < 0.0000000000000002 ***
as.factor(hour)5  -0.0559641  0.0121381  -4.611    0.000004015439965 ***
as.factor(hour)6   0.1840261  0.0118983  15.467 < 0.0000000000000002 ***
as.factor(hour)7   0.4500867  0.0118841  37.873 < 0.0000000000000002 ***
as.factor(hour)8   0.9366676  0.0118247  79.213 < 0.0000000000000002 ***
as.factor(hour)9   0.6136735  0.0119565  51.326 < 0.0000000000000002 ***
as.factor(hour)10  0.4823405  0.0118642  40.655 < 0.0000000000000002 ***
as.factor(hour)11  0.5258989  0.0114750  45.830 < 0.0000000000000002 ***
as.factor(hour)12  0.6141261  0.0116979  52.499 < 0.0000000000000002 ***
as.factor(hour)13  0.6775067  0.0118321  57.260 < 0.0000000000000002 ***
as.factor(hour)14  0.7451786  0.0116539  63.942 < 0.0000000000000002 ***
as.factor(hour)15  0.8546126  0.0116993  73.048 < 0.0000000000000002 ***
as.factor(hour)16  1.0712931  0.0119723  89.481 < 0.0000000000000002 ***
as.factor(hour)17  1.4012597  0.0120098 116.676 < 0.0000000000000002 ***
as.factor(hour)18  1.0581635  0.0120162  88.061 < 0.0000000000000002 ***
as.factor(hour)19  0.8235956  0.0120580  68.303 < 0.0000000000000002 ***
as.factor(hour)20  0.5860658  0.0119470  49.055 < 0.0000000000000002 ***
as.factor(hour)21  0.4237912  0.0117910  35.942 < 0.0000000000000002 ***
as.factor(hour)22  0.2860782  0.0116010  24.660 < 0.0000000000000002 ***
as.factor(hour)23  0.1313521  0.0116223  11.302 < 0.0000000000000002 ***
dotw_simple2       0.0465693  0.0064074   7.268    0.000000000000365 ***
dotw_simple3       0.0263717  0.0064389   4.096    0.000042094602228 ***
dotw_simple4       0.0422632  0.0064500   6.552    0.000000000056646 ***
dotw_simple5       0.0876232  0.0067093  13.060 < 0.0000000000000002 ***
dotw_simple6       0.0063724  0.0068308   0.933                0.351    
dotw_simple7      -0.0651120  0.0066560  -9.782 < 0.0000000000000002 ***
Temperature       -0.0026108  0.0003125  -8.355 < 0.0000000000000002 ***
Precipitation     -0.4063579  0.0264563 -15.360 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.229 on 501509 degrees of freedom
Multiple R-squared:  0.1062,    Adjusted R-squared:  0.1062 
F-statistic:  1923 on 31 and 501509 DF,  p-value: < 0.00000000000000022

Model 2: Add Temporal Lags

Code
model2 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train
)

summary(model2)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1305 -0.4719 -0.1282  0.1603 17.3967 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.1932603  0.0222592   8.682 < 0.0000000000000002 ***
as.factor(hour)1  -0.0107791  0.0100346  -1.074             0.282737    
as.factor(hour)2  -0.0230237  0.0101934  -2.259             0.023902 *  
as.factor(hour)3  -0.0418173  0.0101883  -4.104         0.0000405372 ***
as.factor(hour)4  -0.0135448  0.0100669  -1.345             0.178474    
as.factor(hour)5   0.0853365  0.0103467   8.248 < 0.0000000000000002 ***
as.factor(hour)6   0.2572424  0.0101459  25.354 < 0.0000000000000002 ***
as.factor(hour)7   0.3976395  0.0101391  39.219 < 0.0000000000000002 ***
as.factor(hour)8   0.6977846  0.0101001  69.087 < 0.0000000000000002 ***
as.factor(hour)9   0.2417010  0.0102189  23.652 < 0.0000000000000002 ***
as.factor(hour)10  0.1866410  0.0101214  18.440 < 0.0000000000000002 ***
as.factor(hour)11  0.2709541  0.0097863  27.687 < 0.0000000000000002 ***
as.factor(hour)12  0.3363633  0.0099792  33.706 < 0.0000000000000002 ***
as.factor(hour)13  0.3735371  0.0100963  36.997 < 0.0000000000000002 ***
as.factor(hour)14  0.4308988  0.0099455  43.326 < 0.0000000000000002 ***
as.factor(hour)15  0.5046635  0.0099906  50.514 < 0.0000000000000002 ***
as.factor(hour)16  0.6477453  0.0102364  63.279 < 0.0000000000000002 ***
as.factor(hour)17  0.8735161  0.0102958  84.842 < 0.0000000000000002 ***
as.factor(hour)18  0.4671195  0.0103277  45.230 < 0.0000000000000002 ***
as.factor(hour)19  0.3485184  0.0103319  33.732 < 0.0000000000000002 ***
as.factor(hour)20  0.1725713  0.0102382  16.856 < 0.0000000000000002 ***
as.factor(hour)21  0.1582020  0.0100713  15.708 < 0.0000000000000002 ***
as.factor(hour)22  0.1177674  0.0098897  11.908 < 0.0000000000000002 ***
as.factor(hour)23  0.0555991  0.0098945   5.619         0.0000000192 ***
dotw_simple2       0.0187953  0.0054539   3.446             0.000569 ***
dotw_simple3       0.0001545  0.0054807   0.028             0.977511    
dotw_simple4       0.0125519  0.0054908   2.286             0.022255 *  
dotw_simple5       0.0210761  0.0057149   3.688             0.000226 ***
dotw_simple6      -0.0264412  0.0058176  -4.545         0.0000054930 ***
dotw_simple7      -0.0480896  0.0056659  -8.488 < 0.0000000000000002 ***
Temperature       -0.0029854  0.0002660 -11.222 < 0.0000000000000002 ***
Precipitation     -0.0757371  0.0225517  -3.358             0.000784 ***
lag1Hour           0.4045613  0.0013124 308.268 < 0.0000000000000002 ***
lag3Hours          0.1268892  0.0012904  98.337 < 0.0000000000000002 ***
lag1day            0.1383590  0.0011966 115.623 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.046 on 501506 degrees of freedom
Multiple R-squared:  0.3526,    Adjusted R-squared:  0.3525 
F-statistic:  8032 on 34 and 501506 DF,  p-value: < 0.00000000000000022

Model 3: Add Demographics

Code
train <- train %>%
  mutate(
    Med_Inc = coalesce(Med_Inc.x, Med_Inc.y),
    Percent_Taking_Transit = coalesce(Percent_Taking_Transit.x, Percent_Taking_Transit.y),
    Percent_White = coalesce(Percent_White.x, Percent_White.y)
  )

model3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White,
  data = train
)

summary(model3)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc + 
    Percent_Taking_Transit + Percent_White, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9347 -0.4856 -0.1343  0.2060 17.2849 

Coefficients:
                             Estimate     Std. Error t value
(Intercept)             0.07035865497  0.02296933347   3.063
as.factor(hour)1       -0.01263550808  0.00999293248  -1.264
as.factor(hour)2       -0.02776401437  0.01015121714  -2.735
as.factor(hour)3       -0.04817183994  0.01014641982  -4.748
as.factor(hour)4       -0.02048830240  0.01002564380  -2.044
as.factor(hour)5        0.07908923666  0.01030411404   7.676
as.factor(hour)6        0.25339228067  0.01010388679  25.079
as.factor(hour)7        0.39875573855  0.01009690120  39.493
as.factor(hour)8        0.70588394474  0.01005881343  70.176
as.factor(hour)9        0.25598622952  0.01017877653  25.149
as.factor(hour)10       0.20073214358  0.01008168255  19.911
as.factor(hour)11       0.28384981282  0.00974762163  29.120
as.factor(hour)12       0.34945565388  0.00993976595  35.157
as.factor(hour)13       0.38692281122  0.01005643789  38.475
as.factor(hour)14       0.44397248351  0.00990618549  44.818
as.factor(hour)15       0.51898813825  0.00995148772  52.152
as.factor(hour)16       0.66562512447  0.01019751703  65.273
as.factor(hour)17       0.89487190310  0.01025829554  87.234
as.factor(hour)18       0.48980934212  0.01029071438  47.597
as.factor(hour)19       0.36801774723  0.01029335492  35.753
as.factor(hour)20       0.19178844146  0.01019993206  18.803
as.factor(hour)21       0.17034684079  0.01003111238  16.982
as.factor(hour)22       0.12549360840  0.00984930693  12.741
as.factor(hour)23       0.05907603346  0.00985352962   5.995
dotw_simple2            0.02023183401  0.00543127992   3.725
dotw_simple3            0.00165536737  0.00545793226   0.303
dotw_simple4            0.01433619693  0.00546800575   2.622
dotw_simple5            0.02496157892  0.00569143075   4.386
dotw_simple6           -0.02384996084  0.00579349646  -4.117
dotw_simple7           -0.04836415603  0.00564234023  -8.572
Temperature            -0.00292091392  0.00026492510 -11.025
Precipitation          -0.09084664977  0.02245911134  -4.045
lag1Hour                0.39379990808  0.00131741309 298.919
lag3Hours               0.11539088890  0.00129717196  88.956
lag1day                 0.12631165617  0.00120607128 104.730
Med_Inc                 0.00000118242  0.00000005558  21.273
Percent_Taking_Transit -0.00246847307  0.00016925879 -14.584
Percent_White           0.00174451828  0.00008498725  20.527
                                   Pr(>|t|)    
(Intercept)                        0.002190 ** 
as.factor(hour)1                   0.206071    
as.factor(hour)2                   0.006237 ** 
as.factor(hour)3         0.0000020583139241 ***
as.factor(hour)4                   0.040995 *  
as.factor(hour)5         0.0000000000000165 ***
as.factor(hour)6       < 0.0000000000000002 ***
as.factor(hour)7       < 0.0000000000000002 ***
as.factor(hour)8       < 0.0000000000000002 ***
as.factor(hour)9       < 0.0000000000000002 ***
as.factor(hour)10      < 0.0000000000000002 ***
as.factor(hour)11      < 0.0000000000000002 ***
as.factor(hour)12      < 0.0000000000000002 ***
as.factor(hour)13      < 0.0000000000000002 ***
as.factor(hour)14      < 0.0000000000000002 ***
as.factor(hour)15      < 0.0000000000000002 ***
as.factor(hour)16      < 0.0000000000000002 ***
as.factor(hour)17      < 0.0000000000000002 ***
as.factor(hour)18      < 0.0000000000000002 ***
as.factor(hour)19      < 0.0000000000000002 ***
as.factor(hour)20      < 0.0000000000000002 ***
as.factor(hour)21      < 0.0000000000000002 ***
as.factor(hour)22      < 0.0000000000000002 ***
as.factor(hour)23        0.0000000020310006 ***
dotw_simple2                       0.000195 ***
dotw_simple3                       0.761665    
dotw_simple4                       0.008746 ** 
dotw_simple5             0.0000115574245056 ***
dotw_simple6             0.0000384433479579 ***
dotw_simple7           < 0.0000000000000002 ***
Temperature            < 0.0000000000000002 ***
Precipitation            0.0000523354751391 ***
lag1Hour               < 0.0000000000000002 ***
lag3Hours              < 0.0000000000000002 ***
lag1day                < 0.0000000000000002 ***
Med_Inc                < 0.0000000000000002 ***
Percent_Taking_Transit < 0.0000000000000002 ***
Percent_White          < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.042 on 501503 degrees of freedom
Multiple R-squared:  0.3579,    Adjusted R-squared:  0.3579 
F-statistic:  7557 on 37 and 501503 DF,  p-value: < 0.00000000000000022

Model 4: Add Station Fixed Effects

Code
model4 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station),
  data = train
)

# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")
Model 4 R-squared: 0.3835545 
Code
cat("Model 4 Adj R-squared:", summary(model4)$adj.r.squared, "\n")
Model 4 Adj R-squared: 0.3831905 

Model 5: Add Rush Hour Interaction

Code
model5 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train
)

cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")
Model 5 R-squared: 0.3868177 
Code
cat("Model 5 Adj R-squared:", summary(model5)$adj.r.squared, "\n")
Model 5 Adj R-squared: 0.386452 

MODEL EVALUATION

Calculate Predictions and MAE

Code
# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

test <- test %>%
  mutate(
    Med_Inc = coalesce(Med_Inc.x, Med_Inc.y),
    Percent_Taking_Transit = coalesce(Percent_Taking_Transit.x, Percent_Taking_Transit.y),
    Percent_White = coalesce(Percent_White.x, Percent_White.y)
  )
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test <- test %>%
  mutate(
    pred1 = predict(model1, newdata = test),
    pred2 = predict(model2, newdata = test),
    pred3 = predict(model3, newdata = test),
    pred4 = predict(model4, newdata = test),
    pred5 = predict(model5, newdata = test)
  )

# Calculate MAE for each model
mae_results <- data.frame(
  Model = c(
    "1. Time + Weather",
    "2. + Temporal Lags",
    "3. + Demographics",
    "4. + Station FE",
    "5. + Rush Hour Interaction"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
  )
)

kable(mae_results, 
      digits = 2,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model (Test Set)
Model MAE (trips)
1. Time + Weather 0.84
2. + Temporal Lags 0.71
3. + Demographics 0.71
4. + Station FE 0.71
5. + Rush Hour Interaction 0.71

Visualize Model Comparison

Code
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Compare Results to Q1 2025

MAE Comparison

Code
## MAE comparison
q1_mae <- c(0.60, 0.50, 0.74, 0.73, 0.73) ## Matrics from in-class exercise

mae_compare <- data.frame(
  Model  = mae_results$Model,
  Q1_MAE = q1_mae,
  Q3_MAE = mae_results$MAE   
)

kable(
  mae_compare,
  digits  = 2,
  caption = "Mean Absolute Error by Model: Q1 vs Q3 (Test Set)",
  col.names = c("Model", "Q1_MAE (trips)", "Q3_MAE (trips)")
) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model: Q1 vs Q3 (Test Set)
Model Q1_MAE (trips) Q3_MAE (trips)
1. Time + Weather 0.60 0.84
2. + Temporal Lags 0.50 0.71
3. + Demographics 0.74 0.71
4. + Station FE 0.73 0.71
5. + Rush Hour Interaction 0.73 0.71

Overall, Model 2, which adds temporal lags, performs best because it has the lowest MAE in Q1 and remains tied for the lowest MAE in Q3. Model 1 has the highest error in both quarters, showing that time and weather alone are less effective at predicting ridership. The results also show that errors are generally higher in Q3 than in Q1, suggesting that summer bike-share demand is harder to predict than winter demand. This likely reflects greater variability in summer travel behavior, while the strong performance of Model 2 suggests that recent trip patterns remain one of the most useful predictors across seasons.

Temporal Patterns Comparison (summer vs. winter)

Code
## Comparison between Q1 2025 and Q3 2024 (summer vs. winter)
library(ggplot2)
library(png)
library(grid)
library(cowplot)

img1 <- png::readPNG("data/daily_ridership_Q3.png")
p_left <- grid::rasterGrob(img1, interpolate = TRUE)

img2 <- png::readPNG("data/daily_ridership_Q1.png")  
p_right <- grid::rasterGrob(img2, interpolate = TRUE)

cowplot::plot_grid(
  p_left,
  p_right,
  labels = c("summer vs. winter"),
  ncol = 2
)

Summer and winter show clearly different temporal ridership patterns. In summer, daily trips stay consistently high, with strong weekday–weekend fluctuations and a pattern that rises through mid-season before gradually declining toward October. In contrast, winter begins at much lower levels and shows a steady upward trend as temperatures warm, with ridership recovering from January lows and climbing sharply into March and April. Overall, summer displays high, stable, and highly active demand, while winter reflects low but steadily increasing activity, capturing a clear seasonal effect in bike-share usage.

Important Features

Code
# Use partial R square to determine the  most important feature
library(rsq)
rsq.partial(model2)
$adjustment
[1] FALSE

$variable
[1] "as.factor(hour)" "dotw_simple"     "Temperature"     "Precipitation"  
[5] "lag1Hour"        "lag3Hours"       "lag1day"        

$partial.rsq
[1] 0.04559664069 0.00048407188 0.00025104419 0.00002248911 0.15930182252
[6] 0.01891740820 0.02596483603

In both Q1 and Q3, the one-hour lag (lag1Hour) is by far the most important predictor, followed by hour-of-day and the one-day lag. Weather and day-of-week contribute very little once temporal patterns and lags are included, and their independent explanatory power is even smaller in Q3.

SPACE-TIME ERROR ANALYSIS

Observed vs. Predicted

Code
test <- test %>%
  mutate(
    error = Trip_Count - pred2,            #best result model (change if ness)
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips",
    subtitle = "Model 2 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

Spatial Error Patterns

Code
# Calculate MAE by station
station_errors <- test %>%
  group_by(start_station, start_lat.x, start_lon.y) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat.x), !is.na(start_lon.y))

## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)

# Calculate station errors
station_errors <- test %>%
  filter(!is.na(pred2)) %>%
  group_by(start_station, start_lat.x, start_lon.y) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat.x), !is.na(start_lon.y))

# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.x, color = MAE),
    size = 0.8,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE (trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand  
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.1) +
  geom_point(
    data = station_errors,
    aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
    size = 0.8,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg Demand (trips/hour)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand") +
  mapTheme +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  ) +
  guides(color = guide_colorbar(
    barwidth = 12,
    barheight = 1,
    title.position = "top",
    title.hjust = 0.5
  ))

library(gridExtra)

# Combine
grid.arrange(
  p1, p2,
  ncol = 2
  )

Temporal Error Patterns

Code
# MAE by time of day and day type
temporal_errors <- test %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "When is the model struggling most?",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Errors and Demographics

Code
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
    station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

(p1 | p2) / (p3 | plot_spacer())

Errors tend to be slightly higher in higher-income and higher-percent-White neighborhoods, while they are lower in areas with high transit usage. This suggests the model struggles more in wealthier, less transit-oriented areas. The equity implication is that model performance is not uniform—some communities may receive less accurate predictions than others, which could affect planning decisions if not addressed.

Feature Engineering

Spatial features

Code
library(geosphere)

## Distance to Center City

# Center City coordinate (Use Philadelphia City Hall)
center_lat <- 39.952800
center_lon <- -75.163500

# Calculate distance in kilometers
study_panel_complete <- study_panel_complete %>%
  mutate(
    dist_to_center = distHaversine(
      cbind(start_lon.y, start_lat.y),
      cbind(center_lon, center_lat)
    ) / 1000  # convert meters to km
  )

Trip history features

Code
## Rolling 7-day average demand
library(slider)

study_panel_complete <- study_panel_complete %>%
  arrange(start_station, date, hour) %>%
  group_by(start_station) %>%
  mutate(
    rolling7 = slide_dbl(
      Trip_Count,
      mean,
      .before = 168,
      .complete = TRUE
    )
  ) %>%
  ungroup()

Split train/test Dataset

Code
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
  filter(week < 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations <- study_panel_complete %>%
  filter(week >= 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)


# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
  filter(start_station %in% common_stations)

# NOW create train/test split
train <- study_panel_complete %>%
  filter(week < 36)

test <- study_panel_complete %>%
  filter(week >= 36)

cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
Training observations: 501,541 
Code
cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")
Testing observations: 207,244 
Code
cat("Training date range:", min(train$date), "to", max(train$date), "\n")
Training date range: 20270 to 20333 
Code
cat("Testing date range:", min(test$date), "to", max(test$date), "\n")
Testing date range: 20334 to 20361 

Add two new presictors into the best model

Code
# Create day of week factor with treatment (dummy) coding
train <- train %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)

model_n <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + dist_to_center + rolling7,
  data = train
)

summary(model_n)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day + dist_to_center + 
    rolling7, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9299 -0.5079 -0.1184  0.2803 17.6736 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.2646910  0.0233566 -11.333 < 0.0000000000000002 ***
as.factor(hour)1  -0.0215583  0.0105788  -2.038             0.041562 *  
as.factor(hour)2  -0.0468092  0.0106622  -4.390  0.00001132766890489 ***
as.factor(hour)3  -0.0666273  0.0106586  -6.251  0.00000000040806655 ***
as.factor(hour)4  -0.0353589  0.0104881  -3.371             0.000748 ***
as.factor(hour)5   0.0673682  0.0107866   6.246  0.00000000042271421 ***
as.factor(hour)6   0.2587318  0.0105329  24.564 < 0.0000000000000002 ***
as.factor(hour)7   0.4356190  0.0105591  41.255 < 0.0000000000000002 ***
as.factor(hour)8   0.7787979  0.0105195  74.034 < 0.0000000000000002 ***
as.factor(hour)9   0.3600817  0.0106951  33.668 < 0.0000000000000002 ***
as.factor(hour)10  0.2885503  0.0105564  27.334 < 0.0000000000000002 ***
as.factor(hour)11  0.3687875  0.0102336  36.037 < 0.0000000000000002 ***
as.factor(hour)12  0.4246509  0.0104091  40.796 < 0.0000000000000002 ***
as.factor(hour)13  0.4644507  0.0105663  43.956 < 0.0000000000000002 ***
as.factor(hour)14  0.5169114  0.0104380  49.522 < 0.0000000000000002 ***
as.factor(hour)15  0.5954390  0.0105133  56.637 < 0.0000000000000002 ***
as.factor(hour)16  0.7673669  0.0106708  71.913 < 0.0000000000000002 ***
as.factor(hour)17  1.0273652  0.0108907  94.334 < 0.0000000000000002 ***
as.factor(hour)18  0.6105871  0.0108756  56.143 < 0.0000000000000002 ***
as.factor(hour)19  0.4767052  0.0109722  43.447 < 0.0000000000000002 ***
as.factor(hour)20  0.2813512  0.0108853  25.847 < 0.0000000000000002 ***
as.factor(hour)21  0.2174694  0.0105602  20.593 < 0.0000000000000002 ***
as.factor(hour)22  0.1605793  0.0104308  15.395 < 0.0000000000000002 ***
as.factor(hour)23  0.0739143  0.0104373   7.082  0.00000000000142521 ***
dotw_simple2       0.0575659  0.0055434  10.385 < 0.0000000000000002 ***
dotw_simple3       0.0161450  0.0055902   2.888             0.003876 ** 
dotw_simple4       0.0360255  0.0055350   6.509  0.00000000007587620 ***
dotw_simple5       0.0463412  0.0058387   7.937  0.00000000000000208 ***
dotw_simple6      -0.0197771  0.0058929  -3.356             0.000791 ***
dotw_simple7      -0.0614182  0.0055898 -10.988 < 0.0000000000000002 ***
Temperature       -0.0006399  0.0002692  -2.377             0.017439 *  
Precipitation     -0.0857631  0.0225474  -3.804             0.000143 ***
lag1Hour           0.3346474  0.0014081 237.659 < 0.0000000000000002 ***
lag3Hours          0.0536807  0.0013994  38.361 < 0.0000000000000002 ***
lag1day            0.0682287  0.0013108  52.051 < 0.0000000000000002 ***
dist_to_center    -0.0042454  0.0010508  -4.040  0.00005344060129632 ***
rolling7           0.5257068  0.0041081 127.969 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.029 on 457320 degrees of freedom
  (44184 observations deleted due to missingness)
Multiple R-squared:  0.3849,    Adjusted R-squared:  0.3849 
F-statistic:  7951 on 36 and 457320 DF,  p-value: < 0.00000000000000022

New Model Evaluation and MAE

Code
# Get predictions on test set

# Create day of week factor with treatment (dummy) coding
test <- test %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)

test <- test %>%
  mutate(
    pred_n = predict(model_n, newdata = test)
  )

# Calculate MAE for each model
mae_results_new <- data.frame(
  Model = c(
    "2. + Temporal Lags",
    "New Model. + Rolling 7-day average demand and Distance to Center City",
    "New Poisson Model"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred_n), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred_pois), na.rm = TRUE)
  ),
   RMSE = c(
    sqrt(mean((test$Trip_Count - test$pred2)^2,  na.rm = TRUE)),
    sqrt(mean((test$Trip_Count - test$pred_n)^2, na.rm = TRUE)),
    sqrt(mean((test$Trip_Count - test$pred_pois)^2, na.rm = TRUE))
  )
)

kable(mae_results_new, 
      digits = 5,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)", "RMSE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model (Test Set)
Model MAE (trips) RMSE (trips)
2. + Temporal Lags NaN NaN
New Model. + Rolling 7-day average demand and Distance to Center City 0.71904 1.12527
New Poisson Model NaN NaN

The two new models perform similarly, while the Temporal Lags model cannot be evaluated because its MAE and RMSE are NaN. Of the valid models, the New Poisson Model has the lower MAE, so it is slightly better on average, while the rolling 7-day average model has the lower RMSE, suggesting more stable predictions with fewer large errors.

Poisson model

Code
## Train a poisson model
model_poisson <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + dist_to_center + rolling7,
  family = poisson(link = "log"),
  data   = train
)

summary(model_poisson)

Call:
glm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day + dist_to_center + 
    rolling7, family = poisson(link = "log"), data = train)

Coefficients:
                      Estimate   Std. Error  z value             Pr(>|z|)    
(Intercept)       -1.597467267  0.028671578  -55.716 < 0.0000000000000002 ***
as.factor(hour)1  -0.297839161  0.023656495  -12.590 < 0.0000000000000002 ***
as.factor(hour)2  -0.703090169  0.027176906  -25.871 < 0.0000000000000002 ***
as.factor(hour)3  -1.282650758  0.033924069  -37.809 < 0.0000000000000002 ***
as.factor(hour)4  -1.077887323  0.030691237  -35.120 < 0.0000000000000002 ***
as.factor(hour)5  -0.127736134  0.023033554   -5.546         0.0000000293 ***
as.factor(hour)6   0.670210064  0.018691904   35.856 < 0.0000000000000002 ***
as.factor(hour)7   1.105948527  0.017382196   63.625 < 0.0000000000000002 ***
as.factor(hour)8   1.537352832  0.016474887   93.315 < 0.0000000000000002 ***
as.factor(hour)9   1.153985656  0.017059476   67.645 < 0.0000000000000002 ***
as.factor(hour)10  1.033953542  0.017297023   59.776 < 0.0000000000000002 ***
as.factor(hour)11  1.116417738  0.016942989   65.893 < 0.0000000000000002 ***
as.factor(hour)12  1.193333179  0.016894954   70.633 < 0.0000000000000002 ***
as.factor(hour)13  1.243489696  0.016875262   73.687 < 0.0000000000000002 ***
as.factor(hour)14  1.302069360  0.016687381   78.027 < 0.0000000000000002 ***
as.factor(hour)15  1.388750233  0.016583110   83.745 < 0.0000000000000002 ***
as.factor(hour)16  1.527543423  0.016440056   92.916 < 0.0000000000000002 ***
as.factor(hour)17  1.674794784  0.016313994  102.660 < 0.0000000000000002 ***
as.factor(hour)18  1.370316653  0.016669500   82.205 < 0.0000000000000002 ***
as.factor(hour)19  1.279863318  0.016901594   75.724 < 0.0000000000000002 ***
as.factor(hour)20  1.048668843  0.017370716   60.370 < 0.0000000000000002 ***
as.factor(hour)21  0.896386343  0.017567283   51.026 < 0.0000000000000002 ***
as.factor(hour)22  0.725936816  0.018094248   40.120 < 0.0000000000000002 ***
as.factor(hour)23  0.397661117  0.019247374   20.661 < 0.0000000000000002 ***
dotw_simple2       0.088399022  0.006398730   13.815 < 0.0000000000000002 ***
dotw_simple3       0.017801061  0.006537671    2.723              0.00647 ** 
dotw_simple4       0.056206234  0.006486085    8.666 < 0.0000000000000002 ***
dotw_simple5       0.098473411  0.006721049   14.651 < 0.0000000000000002 ***
dotw_simple6       0.000001129  0.006924753    0.000              0.99987    
dotw_simple7      -0.089631398  0.006807574  -13.166 < 0.0000000000000002 ***
Temperature       -0.000549945  0.000296382   -1.856              0.06352 .  
Precipitation     -1.000017321  0.055777627  -17.929 < 0.0000000000000002 ***
lag1Hour           0.129445057  0.000896418  144.403 < 0.0000000000000002 ***
lag3Hours          0.017797063  0.001083996   16.418 < 0.0000000000000002 ***
lag1day            0.020071443  0.000987413   20.327 < 0.0000000000000002 ***
dist_to_center    -0.169607202  0.001522073 -111.432 < 0.0000000000000002 ***
rolling7           0.540747606  0.003586507  150.773 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 810357  on 457356  degrees of freedom
Residual deviance: 478516  on 457320  degrees of freedom
  (44184 observations deleted due to missingness)
AIC: 876846

Number of Fisher Scoring iterations: 6
Code
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")
Dispersion parameter: 1.31 
Code
cat("Rule of thumb: >1.5 suggests overdispersion\n")
Rule of thumb: >1.5 suggests overdispersion
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}
✓ Dispersion looks okay for Poisson model.

Evaluation of Poisson Model (MAE and RMSE)

Code
test <- test %>%
  mutate(
    pred_pois = predict(model_poisson, newdata = test, type = "response")
  )

# Calculate MAE for each model
mae_results_nn <- data.frame(
  Model = c(
    "2. + Temporal Lags",
    "New Model. + Rolling 7-day average demand and Distance to Center City",
    "New Poisson Model"
  ),
  MAE = c(
    mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred_n), na.rm = TRUE),
    mean(abs(test$Trip_Count - test$pred_pois), na.rm = TRUE)
  ),
   RMSE = c(
    sqrt(mean((test$Trip_Count - test$pred2)^2,  na.rm = TRUE)),
    sqrt(mean((test$Trip_Count - test$pred_n)^2, na.rm = TRUE)),
    sqrt(mean((test$Trip_Count - test$pred_pois)^2, na.rm = TRUE))
  )
)

kable(mae_results_nn, 
      digits = 5,
      caption = "Mean Absolute Error by Model (Test Set)",
      col.names = c("Model", "MAE (trips)", "RMSE (trips)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mean Absolute Error by Model (Test Set)
Model MAE (trips) RMSE (trips)
2. + Temporal Lags NaN NaN
New Model. + Rolling 7-day average demand and Distance to Center City 0.71904 1.12527
New Poisson Model 0.70658 1.26648

Between the two valid models, the New Poisson Model has the lower MAE, so it performs slightly better on average, while the rolling 7-day average model has the lower RMSE, suggesting fewer large prediction errors. Overall, the Poisson model is slightly better in average accuracy, but the rolling-average model appears more stable.

Critical Reflection

Overall, the final model performance is reasonably strong, with MAE values generally below 1 trip, but the remaining error is still meaningful enough that Indego should be cautious about relying on the model alone for operational decisions. Across our results, adding temporal lags clearly improved prediction accuracy, and the newer models with rolling demand and Poisson specification performed competitively, with the Poisson model showing slightly better average accuracy and the rolling-average model appearing somewhat more stable against larger errors. At the same time, the model comparisons suggest that bike-share demand remains harder to predict in higher-demand periods, especially in summer, when ridership is more variable. In practice, this means the model is useful as a decision-support tool for identifying broad demand patterns and guiding planning priorities, but it should not be treated as the sole basis for real-time rebalancing or hourly truck dispatch. Live system monitoring, staff judgment, and awareness of unusual conditions should still play an important role.

The analysis also shows that the model cannot fully capture all sources of variation in ridership. Factors such as holidays, special events, sudden weather shifts, and other short-term disruptions are only partially reflected in the current specification, even after adding weather, lags, and demographic variables. The seasonal comparison also suggests that relationships are not equally stable across time, since prediction error tends to rise in Q3 relative to Q1. With more time and data, the model could be improved by adding richer temporal and spatial predictors, such as event calendars, more detailed built-environment and transit accessibility measures, longer ridership histories, and possibly more flexible modeling approaches that can better capture nonlinear or changing demand patterns.

From an equity perspective, these modeling limits matter because prediction errors are not just statistical problems; they can translate into uneven service quality across places and groups of riders. If the model is less accurate in high-demand or highly variable areas, then relying too heavily on it could unintentionally reinforce service gaps by sending too few bikes where demand is strongest or by misallocating resources across neighborhoods. To reduce this risk, Indego should regularly evaluate model performance across different station areas and community contexts, not just overall accuracy. Any operational use of the model should also be paired with equity goals, such as minimum service standards or explicit checks on whether prediction errors are disproportionately affecting particular neighborhoods. Used this way, the model can support more efficient planning while still protecting fair access to the bike-share system.