# Load the tidyverse library
library(tidyverse)
# Read in the car sales data
# Make sure the data file is in your lab_0/data/ folder
car_data <- read_csv("data/car_sales_data.csv")Lab 0: Getting Started with dplyr
My First Data Analysis
Overview
Welcome to your first lab! In this (not graded) assignment, you’ll practice the fundamental dplyr operations I overviewed in class using car sales data. This lab will help you get comfortable with:
- Basic data exploration
- Column selection and manipulation
- Creating new variables
- Filtering data
- Grouping and summarizing
Instructions: Copy this template into your portfolio repository under a lab_0/ folder, then complete each section with your code and answers. You will write the code under the comment section in each chunk. Be sure to also copy the data folder into your lab_0 folder.
Setup
Exercise 1: Getting to Know Your Data
1.1 Data Structure Exploration
Explore the structure of your data and answer these questions:
# Use glimpse() to see the data structure
glimpse (car_data)Rows: 50,000
Columns: 7
$ Manufacturer <chr> "Ford", "Porsche", "Ford", "Toyota", "VW", "Ford…
$ Model <chr> "Fiesta", "718 Cayman", "Mondeo", "RAV4", "Polo"…
$ `Engine size` <dbl> 1.0, 4.0, 1.6, 1.8, 1.0, 1.4, 1.8, 1.4, 1.2, 2.0…
$ `Fuel type` <chr> "Petrol", "Petrol", "Diesel", "Hybrid", "Petrol"…
$ `Year of manufacture` <dbl> 2002, 2016, 2014, 1988, 2006, 2018, 2010, 2015, …
$ Mileage <dbl> 127300, 57850, 39190, 210814, 127869, 33603, 866…
$ Price <dbl> 3074, 49704, 24072, 1705, 4101, 29204, 14350, 30…
# Check the column names
names(car_data)[1] "Manufacturer" "Model" "Engine size"
[4] "Fuel type" "Year of manufacture" "Mileage"
[7] "Price"
# Look at the first few rows
head(car_data)# A tibble: 6 × 7
Manufacturer Model `Engine size` `Fuel type` `Year of manufacture` Mileage
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Ford Fiesta 1 Petrol 2002 127300
2 Porsche 718 Caym… 4 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
# ℹ 1 more variable: Price <dbl>
Questions to answer: - How many rows and columns does the dataset have? - What types of variables do you see (numeric, character, etc.)? - Are there any column names that might cause problems? Why?
Your answers: - Rows: 6 - Columns: 7
- Variable types: categorical (nominal) variables - Problematic names: Engine size, Fuel type, Year of manufacture
1.2 Tibble vs Data Frame
Compare how tibbles and data frames display:
# Look at the tibble version (what we have)
head(car_data)# A tibble: 6 × 7
Manufacturer Model `Engine size` `Fuel type` `Year of manufacture` Mileage
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Ford Fiesta 1 Petrol 2002 127300
2 Porsche 718 Caym… 4 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
# ℹ 1 more variable: Price <dbl>
# Convert to regular data frame and display
car_df <- as.data.frame(car_data)
head(car_df) Manufacturer Model Engine size Fuel type Year of manufacture Mileage
1 Ford Fiesta 1.0 Petrol 2002 127300
2 Porsche 718 Cayman 4.0 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1.0 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
Price
1 3074
2 49704
3 24072
4 1705
5 4101
6 29204
Question: What differences do you notice in how they print?
Your answer: Tibbles print only the rows and columns that fit on the screen, display column types, and show the total number of rows and columns. In contrast, data frames print more rows by default and do not display column type information, making them less readable.
Exercise 2: Basic Column Operations
2.1 Selecting Columns
Practice selecting different combinations of columns:
# Select just Model and Mileage columns
head(select(car_df, Model, Mileage)) Model Mileage
1 Fiesta 127300
2 718 Cayman 57850
3 Mondeo 39190
4 RAV4 210814
5 Polo 127869
6 Focus 33603
# Select Manufacturer, Price, and Fuel type
head(select(car_df, Manufacturer, Price, 'Fuel type')) Manufacturer Price Fuel type
1 Ford 3074 Petrol
2 Porsche 49704 Petrol
3 Ford 24072 Diesel
4 Toyota 1705 Hybrid
5 VW 4101 Petrol
6 Ford 29204 Petrol
# Challenge: Select all columns EXCEPT Engine Size
head(select(car_df, -'Engine size')) Manufacturer Model Fuel type Year of manufacture Mileage Price
1 Ford Fiesta Petrol 2002 127300 3074
2 Porsche 718 Cayman Petrol 2016 57850 49704
3 Ford Mondeo Diesel 2014 39190 24072
4 Toyota RAV4 Hybrid 1988 210814 1705
5 VW Polo Petrol 2006 127869 4101
6 Ford Focus Petrol 2018 33603 29204
2.2 Renaming Columns
Let’s fix a problematic column name:
# Rename 'Year of manufacture' to year
car_df <- rename(car_df, year = `Year of manufacture`)
# Check that it worked
names (car_df)[1] "Manufacturer" "Model" "Engine size" "Fuel type" "year"
[6] "Mileage" "Price"
Question: Why did we need backticks around Year of manufacture but not around year?
Your answer: Backticks are required for Year of manufacture because it contains spaces and is not a valid R identifier. The column name year follows R’s naming rules and can be referenced directly without backticks.
Exercise 3: Creating New Columns
3.1 Calculate Car Age
# Create an 'age' column (2025 minus year of manufacture)
car_df <- mutate(
car_df,
age = 2025 - year,
mileage_per_year = Mileage / age
)
# Create a mileage_per_year column
# Look at your new columns
#select(car_data, Model, year, age, Mileage, mileage_per_year)3.2 Categorize Cars
# Create a price_category column where if price is < 15000, its is coded as budget, between 15000 and 30000 is midrange and greater than 30000 is mid-range (use case_when)
car_df <- mutate(
car_df,
price_category = case_when(
Price < 15000 ~ "budget",
Price >= 15000 & Price <= 30000 ~ "midrange",
Price > 30000 ~ "premium"
)
)
# Check your categories select the new column and show it
head(select(car_df, Model, Price, price_category)) Model Price price_category
1 Fiesta 3074 budget
2 718 Cayman 49704 premium
3 Mondeo 24072 midrange
4 RAV4 1705 budget
5 Polo 4101 budget
6 Focus 29204 midrange
Exercise 4: Filtering Practice
4.1 Basic Filtering
# Find all Toyota cars
head(filter(car_df, Manufacturer == "Toyota")) Manufacturer Model Engine size Fuel type year Mileage Price age
1 Toyota RAV4 1.8 Hybrid 1988 210814 1705 37
2 Toyota Prius 1.4 Hybrid 2015 30663 30297 10
3 Toyota RAV4 2.2 Petrol 2007 79393 16026 18
4 Toyota Yaris 1.4 Petrol 1998 97286 4046 27
5 Toyota RAV4 2.4 Hybrid 2003 117425 11667 22
6 Toyota Yaris 1.2 Petrol 1992 245990 720 33
mileage_per_year price_category
1 5697.676 budget
2 3066.300 premium
3 4410.722 midrange
4 3603.185 budget
5 5337.500 budget
6 7454.242 budget
# Find cars with mileage less than 30,000
head(filter(car_df, Mileage < 30000)) Manufacturer Model Engine size Fuel type year Mileage Price age
1 Toyota RAV4 2.0 Hybrid 2018 28381 52671 7
2 VW Golf 2.0 Petrol 2020 18985 36387 5
3 BMW M5 4.0 Petrol 2017 22759 97758 8
4 Toyota RAV4 2.4 Petrol 2018 24588 49125 7
5 VW Golf 2.0 Hybrid 2018 25017 36957 7
6 Porsche 718 Cayman 2.4 Petrol 2021 14070 69526 4
mileage_per_year price_category
1 4054.429 premium
2 3797.000 premium
3 2844.875 premium
4 3512.571 premium
5 3573.857 premium
6 3517.500 premium
# Find luxury cars (from price category) with low mileage
head(filter(car_df, price_category == "premium" & Mileage < 30000)) Manufacturer Model Engine size Fuel type year Mileage Price age
1 Toyota RAV4 2.0 Hybrid 2018 28381 52671 7
2 VW Golf 2.0 Petrol 2020 18985 36387 5
3 BMW M5 4.0 Petrol 2017 22759 97758 8
4 Toyota RAV4 2.4 Petrol 2018 24588 49125 7
5 VW Golf 2.0 Hybrid 2018 25017 36957 7
6 Porsche 718 Cayman 2.4 Petrol 2021 14070 69526 4
mileage_per_year price_category
1 4054.429 premium
2 3797.000 premium
3 2844.875 premium
4 3512.571 premium
5 3573.857 premium
6 3517.500 premium
4.2 Multiple Conditions
# Find cars that are EITHER Honda OR Nissan
car_df %>%
filter(Manufacturer == "Honda" | Manufacturer == "Nissan") %>%
head() [1] Manufacturer Model Engine size Fuel type
[5] year Mileage Price age
[9] mileage_per_year price_category
<0 rows> (or 0-length row.names)
# Find cars with price between $20,000 and $35,000
car_df %>%
filter(Price >= 20000 & Price <= 35000) %>%
head() Manufacturer Model Engine size Fuel type year Mileage Price age
1 Ford Mondeo 1.6 Diesel 2014 39190 24072 11
2 Ford Focus 1.4 Petrol 2018 33603 29204 7
3 Toyota Prius 1.4 Hybrid 2015 30663 30297 10
4 Toyota Prius 1.4 Hybrid 2016 43893 29946 9
5 Toyota Prius 1.4 Hybrid 2016 43130 30085 9
6 VW Passat 1.6 Petrol 2016 64344 23641 9
mileage_per_year price_category
1 3562.727 midrange
2 4800.429 midrange
3 3066.300 premium
4 4877.000 midrange
5 4792.222 premium
6 7149.333 midrange
# Find diesel cars less than 10 years old
diesel_under_10 <- car_df %>%
filter(`Fuel type` == "Diesel", age < 10)Question: How many diesel cars are less than 10 years old?
Your answer: 2040
Exercise 5: Grouping and Summarizing
5.1 Basic Summaries
# Calculate average price by manufacturer
avg_price_by_brand <- car_data %>%
group_by(Manufacturer) %>%
summarize(avg_price = mean(Price, na.rm = TRUE))
avg_price_by_brand# A tibble: 5 × 2
Manufacturer avg_price
<chr> <dbl>
1 BMW 24429.
2 Ford 10672.
3 Porsche 29104.
4 Toyota 14340.
5 VW 10363.
# Calculate average mileage by fuel type
avg_mileage_by_fuel <- car_df %>%
group_by(`Fuel type`) %>%
summarize(avg_mileage = mean(Mileage, na.rm = TRUE))
avg_mileage_by_fuel# A tibble: 3 × 2
`Fuel type` avg_mileage
<chr> <dbl>
1 Diesel 112667.
2 Hybrid 111622.
3 Petrol 112795.
# Count cars by manufacturer
count_by_manufacturer <- car_df %>%
group_by(Manufacturer) %>%
summarize(car_count = n())
count_by_manufacturer# A tibble: 5 × 2
Manufacturer car_count
<chr> <int>
1 BMW 4965
2 Ford 14959
3 Porsche 2609
4 Toyota 12554
5 VW 14913
5.2 Categorical Summaries
# Frequency table for price categories
car_df %>%
count(price_category) price_category n
1 budget 34040
2 midrange 9782
3 premium 6178
Submission Notes
To submit this lab: 1. Make sure your code runs without errors 2. Fill in all the “[YOUR ANSWER]” sections and complete all of the empty code! 3. Save this file in your portfolio’s lab_0/ folder 4. Commit and push to GitHub 5. Check that it appears on your GitHub Pages portfolio site
Questions? Post on the canvas discussion board or come to office hours!