Lab 0: Getting Started with dplyr

My First Data Analysis

Author

Coco Zhou

Published

January 20, 2026

Overview

Welcome to your first lab! In this (not graded) assignment, you’ll practice the fundamental dplyr operations I overviewed in class using car sales data. This lab will help you get comfortable with:

  • Basic data exploration
  • Column selection and manipulation
  • Creating new variables
  • Filtering data
  • Grouping and summarizing

Instructions: Copy this template into your portfolio repository under a lab_0/ folder, then complete each section with your code and answers. You will write the code under the comment section in each chunk. Be sure to also copy the data folder into your lab_0 folder.

Setup

# Load the tidyverse library
library(tidyverse)

# Read in the car sales data
# Make sure the data file is in your lab_0/data/ folder
car_data <- read_csv("data/car_sales_data.csv")

Exercise 1: Getting to Know Your Data

1.1 Data Structure Exploration

Explore the structure of your data and answer these questions:

# Use glimpse() to see the data structure

glimpse (car_data)
Rows: 50,000
Columns: 7
$ Manufacturer          <chr> "Ford", "Porsche", "Ford", "Toyota", "VW", "Ford…
$ Model                 <chr> "Fiesta", "718 Cayman", "Mondeo", "RAV4", "Polo"…
$ `Engine size`         <dbl> 1.0, 4.0, 1.6, 1.8, 1.0, 1.4, 1.8, 1.4, 1.2, 2.0…
$ `Fuel type`           <chr> "Petrol", "Petrol", "Diesel", "Hybrid", "Petrol"…
$ `Year of manufacture` <dbl> 2002, 2016, 2014, 1988, 2006, 2018, 2010, 2015, …
$ Mileage               <dbl> 127300, 57850, 39190, 210814, 127869, 33603, 866…
$ Price                 <dbl> 3074, 49704, 24072, 1705, 4101, 29204, 14350, 30…
# Check the column names
names(car_data)
[1] "Manufacturer"        "Model"               "Engine size"        
[4] "Fuel type"           "Year of manufacture" "Mileage"            
[7] "Price"              
# Look at the first few rows
head(car_data)
# A tibble: 6 × 7
  Manufacturer Model     `Engine size` `Fuel type` `Year of manufacture` Mileage
  <chr>        <chr>             <dbl> <chr>                       <dbl>   <dbl>
1 Ford         Fiesta              1   Petrol                       2002  127300
2 Porsche      718 Caym…           4   Petrol                       2016   57850
3 Ford         Mondeo              1.6 Diesel                       2014   39190
4 Toyota       RAV4                1.8 Hybrid                       1988  210814
5 VW           Polo                1   Petrol                       2006  127869
6 Ford         Focus               1.4 Petrol                       2018   33603
# ℹ 1 more variable: Price <dbl>

Questions to answer: - How many rows and columns does the dataset have? - What types of variables do you see (numeric, character, etc.)? - Are there any column names that might cause problems? Why?

Your answers: - Rows: 6 - Columns: 7
- Variable types: categorical (nominal) variables - Problematic names: Engine size, Fuel type, Year of manufacture

1.2 Tibble vs Data Frame

Compare how tibbles and data frames display:

# Look at the tibble version (what we have)
head(car_data)
# A tibble: 6 × 7
  Manufacturer Model     `Engine size` `Fuel type` `Year of manufacture` Mileage
  <chr>        <chr>             <dbl> <chr>                       <dbl>   <dbl>
1 Ford         Fiesta              1   Petrol                       2002  127300
2 Porsche      718 Caym…           4   Petrol                       2016   57850
3 Ford         Mondeo              1.6 Diesel                       2014   39190
4 Toyota       RAV4                1.8 Hybrid                       1988  210814
5 VW           Polo                1   Petrol                       2006  127869
6 Ford         Focus               1.4 Petrol                       2018   33603
# ℹ 1 more variable: Price <dbl>
# Convert to regular data frame and display
car_df <- as.data.frame(car_data)
head(car_df)
  Manufacturer      Model Engine size Fuel type Year of manufacture Mileage
1         Ford     Fiesta         1.0    Petrol                2002  127300
2      Porsche 718 Cayman         4.0    Petrol                2016   57850
3         Ford     Mondeo         1.6    Diesel                2014   39190
4       Toyota       RAV4         1.8    Hybrid                1988  210814
5           VW       Polo         1.0    Petrol                2006  127869
6         Ford      Focus         1.4    Petrol                2018   33603
  Price
1  3074
2 49704
3 24072
4  1705
5  4101
6 29204

Question: What differences do you notice in how they print?

Your answer: Tibbles print only the rows and columns that fit on the screen, display column types, and show the total number of rows and columns. In contrast, data frames print more rows by default and do not display column type information, making them less readable.

Exercise 2: Basic Column Operations

2.1 Selecting Columns

Practice selecting different combinations of columns:

# Select just Model and Mileage columns
head(select(car_df, Model, Mileage))
       Model Mileage
1     Fiesta  127300
2 718 Cayman   57850
3     Mondeo   39190
4       RAV4  210814
5       Polo  127869
6      Focus   33603
# Select Manufacturer, Price, and Fuel type
head(select(car_df, Manufacturer, Price, 'Fuel type'))
  Manufacturer Price Fuel type
1         Ford  3074    Petrol
2      Porsche 49704    Petrol
3         Ford 24072    Diesel
4       Toyota  1705    Hybrid
5           VW  4101    Petrol
6         Ford 29204    Petrol
# Challenge: Select all columns EXCEPT Engine Size
head(select(car_df, -'Engine size'))
  Manufacturer      Model Fuel type Year of manufacture Mileage Price
1         Ford     Fiesta    Petrol                2002  127300  3074
2      Porsche 718 Cayman    Petrol                2016   57850 49704
3         Ford     Mondeo    Diesel                2014   39190 24072
4       Toyota       RAV4    Hybrid                1988  210814  1705
5           VW       Polo    Petrol                2006  127869  4101
6         Ford      Focus    Petrol                2018   33603 29204

2.2 Renaming Columns

Let’s fix a problematic column name:

# Rename 'Year of manufacture' to year
car_df <- rename(car_df, year = `Year of manufacture`)

# Check that it worked
names (car_df)
[1] "Manufacturer" "Model"        "Engine size"  "Fuel type"    "year"        
[6] "Mileage"      "Price"       

Question: Why did we need backticks around Year of manufacture but not around year?

Your answer: Backticks are required for Year of manufacture because it contains spaces and is not a valid R identifier. The column name year follows R’s naming rules and can be referenced directly without backticks.

Exercise 3: Creating New Columns

3.1 Calculate Car Age

# Create an 'age' column (2025 minus year of manufacture)
car_df <- mutate(
  car_df,
  age = 2025 - year,
  mileage_per_year = Mileage / age
)

# Create a mileage_per_year column  


# Look at your new columns
#select(car_data, Model, year, age, Mileage, mileage_per_year)

3.2 Categorize Cars

# Create a price_category column where if price is < 15000, its is coded as budget, between 15000 and 30000 is midrange and greater than 30000 is mid-range (use case_when)
car_df <- mutate(
  car_df,
  price_category = case_when(
    Price < 15000 ~ "budget",
    Price >= 15000 & Price <= 30000 ~ "midrange",
    Price > 30000 ~ "premium"
  )
)

# Check your categories select the new column and show it
head(select(car_df, Model, Price, price_category))
       Model Price price_category
1     Fiesta  3074         budget
2 718 Cayman 49704        premium
3     Mondeo 24072       midrange
4       RAV4  1705         budget
5       Polo  4101         budget
6      Focus 29204       midrange

Exercise 4: Filtering Practice

4.1 Basic Filtering

# Find all Toyota cars
head(filter(car_df, Manufacturer == "Toyota"))
  Manufacturer Model Engine size Fuel type year Mileage Price age
1       Toyota  RAV4         1.8    Hybrid 1988  210814  1705  37
2       Toyota Prius         1.4    Hybrid 2015   30663 30297  10
3       Toyota  RAV4         2.2    Petrol 2007   79393 16026  18
4       Toyota Yaris         1.4    Petrol 1998   97286  4046  27
5       Toyota  RAV4         2.4    Hybrid 2003  117425 11667  22
6       Toyota Yaris         1.2    Petrol 1992  245990   720  33
  mileage_per_year price_category
1         5697.676         budget
2         3066.300        premium
3         4410.722       midrange
4         3603.185         budget
5         5337.500         budget
6         7454.242         budget
# Find cars with mileage less than 30,000
head(filter(car_df, Mileage < 30000))
  Manufacturer      Model Engine size Fuel type year Mileage Price age
1       Toyota       RAV4         2.0    Hybrid 2018   28381 52671   7
2           VW       Golf         2.0    Petrol 2020   18985 36387   5
3          BMW         M5         4.0    Petrol 2017   22759 97758   8
4       Toyota       RAV4         2.4    Petrol 2018   24588 49125   7
5           VW       Golf         2.0    Hybrid 2018   25017 36957   7
6      Porsche 718 Cayman         2.4    Petrol 2021   14070 69526   4
  mileage_per_year price_category
1         4054.429        premium
2         3797.000        premium
3         2844.875        premium
4         3512.571        premium
5         3573.857        premium
6         3517.500        premium
# Find luxury cars (from price category) with low mileage
head(filter(car_df, price_category == "premium" & Mileage < 30000))
  Manufacturer      Model Engine size Fuel type year Mileage Price age
1       Toyota       RAV4         2.0    Hybrid 2018   28381 52671   7
2           VW       Golf         2.0    Petrol 2020   18985 36387   5
3          BMW         M5         4.0    Petrol 2017   22759 97758   8
4       Toyota       RAV4         2.4    Petrol 2018   24588 49125   7
5           VW       Golf         2.0    Hybrid 2018   25017 36957   7
6      Porsche 718 Cayman         2.4    Petrol 2021   14070 69526   4
  mileage_per_year price_category
1         4054.429        premium
2         3797.000        premium
3         2844.875        premium
4         3512.571        premium
5         3573.857        premium
6         3517.500        premium

4.2 Multiple Conditions

# Find cars that are EITHER Honda OR Nissan
car_df %>%
  filter(Manufacturer == "Honda" | Manufacturer == "Nissan") %>%
  head()
 [1] Manufacturer     Model            Engine size      Fuel type       
 [5] year             Mileage          Price            age             
 [9] mileage_per_year price_category  
<0 rows> (or 0-length row.names)
# Find cars with price between $20,000 and $35,000
car_df %>%
  filter(Price >= 20000 & Price <= 35000) %>%
  head()
  Manufacturer  Model Engine size Fuel type year Mileage Price age
1         Ford Mondeo         1.6    Diesel 2014   39190 24072  11
2         Ford  Focus         1.4    Petrol 2018   33603 29204   7
3       Toyota  Prius         1.4    Hybrid 2015   30663 30297  10
4       Toyota  Prius         1.4    Hybrid 2016   43893 29946   9
5       Toyota  Prius         1.4    Hybrid 2016   43130 30085   9
6           VW Passat         1.6    Petrol 2016   64344 23641   9
  mileage_per_year price_category
1         3562.727       midrange
2         4800.429       midrange
3         3066.300        premium
4         4877.000       midrange
5         4792.222        premium
6         7149.333       midrange
# Find diesel cars less than 10 years old
diesel_under_10 <- car_df %>%
  filter(`Fuel type` == "Diesel", age < 10)

Question: How many diesel cars are less than 10 years old?

Your answer: 2040

Exercise 5: Grouping and Summarizing

5.1 Basic Summaries

# Calculate average price by manufacturer
avg_price_by_brand <- car_data %>%
  group_by(Manufacturer) %>%
  summarize(avg_price = mean(Price, na.rm = TRUE))

avg_price_by_brand
# A tibble: 5 × 2
  Manufacturer avg_price
  <chr>            <dbl>
1 BMW             24429.
2 Ford            10672.
3 Porsche         29104.
4 Toyota          14340.
5 VW              10363.
# Calculate average mileage by fuel type
avg_mileage_by_fuel <- car_df %>%
  group_by(`Fuel type`) %>%
  summarize(avg_mileage = mean(Mileage, na.rm = TRUE))

avg_mileage_by_fuel
# A tibble: 3 × 2
  `Fuel type` avg_mileage
  <chr>             <dbl>
1 Diesel          112667.
2 Hybrid          111622.
3 Petrol          112795.
# Count cars by manufacturer
count_by_manufacturer <- car_df %>%
  group_by(Manufacturer) %>%
  summarize(car_count = n())

count_by_manufacturer
# A tibble: 5 × 2
  Manufacturer car_count
  <chr>            <int>
1 BMW               4965
2 Ford             14959
3 Porsche           2609
4 Toyota           12554
5 VW               14913

5.2 Categorical Summaries

# Frequency table for price categories
car_df %>%
  count(price_category)
  price_category     n
1         budget 34040
2       midrange  9782
3        premium  6178

Submission Notes

To submit this lab: 1. Make sure your code runs without errors 2. Fill in all the “[YOUR ANSWER]” sections and complete all of the empty code! 3. Save this file in your portfolio’s lab_0/ folder 4. Commit and push to GitHub 5. Check that it appears on your GitHub Pages portfolio site

Questions? Post on the canvas discussion board or come to office hours!