This project is part of Google Data Analytics Course on Coursera.
In 2016, Cyclistic launched a successful bike-sharing offering. Since then, the program has grown to a fleet of 5,824 bikes that are geo-tracked and locked into a network of 692 stations across Chicago. Bikes can be accessed from one station and returned to another at any time in the system.
To date, Cyclistic’s marketing strategy has relied on efforts to build general awareness and attract a broad consumer segment. One approach that helps make this happen is price flexibility: one-way tickets, full-day passes, and annual memberships. Customers who purchase a one-way ticket or a full day ticket are referred to as casual riders. Customers who purchase an annual membership are Cyclistic members. Therefore, the marketing analyst team needs to better understand the differences between annual members and regular riders in this month.
For this project, we will use the data for September 2022 here.
This dataset is in the form of CSV (comma-separated value) with 701,339 rows of data. The field in data included:
library("readr")
library("dplyr")
#Importing data
divvy_202209 <- read_csv("~/divvy-tripdata/202209-divvy-publictripdata.csv")
divvy_202208 <- read_csv("~/divvy-tripdata/202208-divvy-tripdata.csv")
divvy_202207 <- read_csv("~/divvy-tripdata/202207-divvy-tripdata.csv")
divvy_202206 <- read_csv("~/divvy-tripdata/202206-divvy-tripdata.csv")
divvy_202205 <- read_csv("~/divvy-tripdata/202205-divvy-tripdata.csv")
divvy_202204 <- read_csv("~/divvy-tripdata/202204-divvy-tripdata.csv")
divvy_202203 <- read_csv("~/divvy-tripdata/202203-divvy-tripdata.csv")
divvy_202202 <- read_csv("~/divvy-tripdata/202202-divvy-tripdata.csv")
divvy_202201 <- read_csv("~/divvy-tripdata/202201-divvy-tripdata.csv")
divvy_202112 <- read_csv("~/divvy-tripdata/202112-divvy-tripdata.csv")
divvy_202111 <- read_csv("~/divvy-tripdata/202111-divvy-tripdata.csv")
divvy_202110 <- read_csv("~/divvy-tripdata/202110-divvy-tripdata.csv")
#Checking data dimension
dim(divvy_202209)
## [1] 701339 13
dim(divvy_202208)
## [1] 785932 13
dim(divvy_202207)
## [1] 823488 13
dim(divvy_202206)
## [1] 769204 13
dim(divvy_202205)
## [1] 634858 13
dim(divvy_202204)
## [1] 371249 13
dim(divvy_202203)
## [1] 284042 13
dim(divvy_202202)
## [1] 115609 13
dim(divvy_202201)
## [1] 103770 13
dim(divvy_202112)
## [1] 247540 13
dim(divvy_202111)
## [1] 359978 13
dim(divvy_202110)
## [1] 631226 13
#Checking columns names
colnames(divvy_202209)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202208)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202207)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202206)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202205)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202204)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202203)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202202)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202201)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202112)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202111)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202110)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Before we start to clean the data, we need to merge all the data into one dataset to make it easier.
#Merging all data
all_data <- bind_rows(divvy_202209, divvy_202208, divvy_202207, divvy_202206, divvy_202205, divvy_202204, divvy_202203, divvy_202202, divvy_202201, divvy_202112, divvy_202111, divvy_202110)
#Checking data dimension
dim(all_data)
## [1] 5828235 13
After we merge all datasets, we get that the data has 5828235 rows and 13 columns.
We will analyze the behavior of member and casual users, so we only need ride_id, rideable_type, started_at, ended_at, and member_casual.
#Selecting columns
all_data <- all_data %>%
select(c(ride_id, rideable_type, started_at,ended_at,member_casual))
head(all_data)
## # A tibble: 6 × 5
## ride_id rideable_type started_at ended_at membe…¹
## <chr> <chr> <dttm> <dttm> <chr>
## 1 5156990AC19CA285 electric_bike 2022-09-01 08:36:22 2022-09-01 08:39:05 casual
## 2 E12D4A16BF51C274 electric_bike 2022-09-01 17:11:29 2022-09-01 17:14:45 casual
## 3 A02B53CD7DB72DD7 electric_bike 2022-09-01 17:15:50 2022-09-01 17:16:12 casual
## 4 C82E05FEE872DF11 electric_bike 2022-09-01 09:00:28 2022-09-01 09:10:32 casual
## 5 4DEEB4550A266AE1 electric_bike 2022-09-01 07:30:11 2022-09-01 07:32:36 casual
## 6 B1721F8C7C3AC6BF electric_bike 2022-09-01 12:04:25 2022-09-01 12:21:08 casual
## # … with abbreviated variable name ¹​member_casual
We will analyze the ride length of users, so we will make new column that shows the length of ride in minutes. We will count the difference between start and end time of ride.
#Make new column for ride length
all_data$ride_length <- difftime(all_data$ended_at, all_data$started_at)
all_data$ride_length <- as.double(all_data$ride_length)
Then, we remove the row with negative values of ride length.
#Remove the row with negative value of ride length
all_data_v2 <- all_data[!all_data$ride_length<0,]
We will analyze bike usage for each day, making new columns for the started day, ended day, and each day in a week.
#Make new column for bike usage in start_date, end_date
all_data_v2$started_date <- as.Date(all_data_v2$started_at)
all_data_v2$ended_date <- as.Date(all_data_v2$ended_at)
all_data_v2$Day_of_week <- format(as.Date(all_data_v2$started_date), "%A")
all_data_v2$Month <- format(as.Date(all_data_v2$started_date), "%B")
We will check data summary of ride length to see any outliers.
#Summary of ride length
summary(all_data_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 356 629 1176 1131 2442301
We get that the minimum ride length is 0 sec, the 1st quantile is 356 secs, the median is 629 secs, the 3rd quantile is 1131 secs, the maximum ride length is 2442301 secs, and the mean is 1176 secs. From the summary, we get the possibilities that the usage of bikes can be more than one day. It can affect our analysis if we want to check the average ride length for each day.
Furthermore, we need to filter the data to analyze the ride that starts and ends on the same day.
#Filtering the data
all_data_v3 <- all_data_v2 %>%
filter(started_date == ended_date)
We check the summary from the new dataset.
#Summary of ride length in new dataset
summary(all_data_v3$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 355.0 626.0 961.3 1123.0 84509.0
We will compare the mean, minimum, maximum, and median of member and casual users in ride length for the same-day trip.
#Compare member and casual user in ride length
mean_data <- aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual, FUN = mean)
min_data <- aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual, FUN = min)
max_data <- aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual, FUN = max)
median_data <- aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual, FUN = median)
mean_data
## all_data_v3$member_casual all_data_v3$ride_length
## 1 casual 1285.6031
## 2 member 735.6604
min_data
## all_data_v3$member_casual all_data_v3$ride_length
## 1 casual 0
## 2 member 0
max_data
## all_data_v3$member_casual all_data_v3$ride_length
## 1 casual 84509
## 2 member 69947
median_data
## all_data_v3$member_casual all_data_v3$ride_length
## 1 casual 801
## 2 member 532
From the data, we can see that the mean ride lengths for the same-day trip are 1285.6031 secs for casual users and 735.6604 for member users.
The minimum ride length for both users is 0 seconds and the maximum ride lengths for casual and member users are 84509 secs and 69947 secs, consecutively.
The median ride length for casual and member users is 801 secs and 532 secs.
Then, we compare the total of members and casual users based on rideable type.
#Compare member and casual user with rideable type
library("janitor")
all_data_v3 %>%
tabyl(rideable_type,member_casual) %>%
adorn_totals()
## rideable_type casual member
## classic_bike 930711 1793134
## docked_bike 186999 0
## electric_bike 1258594 1622597
## Total 2376304 3415731
We can see that there are no member users who use docked bike. The number of member users who use classic bike and electric bike are larger than casual users. We also can see the total of member users is larger than casual users.
Then, we compare the total of member and casual user based on ride length each day and each month.
#Compare member and casual user in ride length each day
aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual + all_data_v3$Day_of_week, FUN =mean)
## all_data_v3$member_casual all_data_v3$Day_of_week all_data_v3$ride_length
## 1 casual Friday 1197.4401
## 2 member Friday 721.5283
## 3 casual Monday 1313.9411
## 4 member Monday 712.7514
## 5 casual Saturday 1439.2714
## 6 member Saturday 819.5850
## 7 casual Sunday 1483.7539
## 8 member Sunday 818.5250
## 9 casual Thursday 1131.3061
## 10 member Thursday 708.0044
## 11 casual Tuesday 1147.4143
## 12 member Tuesday 698.5669
## 13 casual Wednesday 1114.3522
## 14 member Wednesday 701.5080
#Compare member and casual user in ride length each month
aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual + all_data_v3$Month, FUN =mean)
## all_data_v3$member_casual all_data_v3$Month all_data_v3$ride_length
## 1 casual April 1318.0061
## 2 member April 668.7981
## 3 casual August 1240.4946
## 4 member August 774.3197
## 5 casual December 1024.2297
## 6 member December 638.9638
## 7 casual February 1109.5114
## 8 member February 641.4960
## 9 casual January 959.1474
## 10 member January 646.1426
## 11 casual July 1341.2215
## 12 member July 798.4347
## 13 casual June 1342.9415
## 14 member June 805.6161
## 15 casual March 1355.4222
## 16 member March 686.1239
## 17 casual May 1454.2127
## 18 member May 773.2102
## 19 casual November 1063.0526
## 20 member November 647.2846
## 21 casual October 1307.7080
## 22 member October 720.8491
## 23 casual September 1154.8565
## 24 member September 744.7380
The days and months in the output are not in order, so we need to rearrange it first.
#Ordering day
all_data_v3$Day_of_week <- ordered(all_data_v3$Day_of_week, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday","Sunday"))
#Ordering month
all_data_v3$Month <- ordered(all_data_v3$Month, levels=c("January", "February", "March", "April","May","June","July", "August", "September", "October", "November", "December"))
Then we can check the data with new order.
#Compare member and casual user in ride length each day
aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual + all_data_v3$Day_of_week, FUN =mean)
## all_data_v3$member_casual all_data_v3$Day_of_week all_data_v3$ride_length
## 1 casual Monday 1313.9411
## 2 member Monday 712.7514
## 3 casual Tuesday 1147.4143
## 4 member Tuesday 698.5669
## 5 casual Wednesday 1114.3522
## 6 member Wednesday 701.5080
## 7 casual Thursday 1131.3061
## 8 member Thursday 708.0044
## 9 casual Friday 1197.4401
## 10 member Friday 721.5283
## 11 casual Saturday 1439.2714
## 12 member Saturday 819.5850
## 13 casual Sunday 1483.7539
## 14 member Sunday 818.5250
#Compare member and casual user in ride length each month
aggregate(all_data_v3$ride_length ~ all_data_v3$member_casual + all_data_v3$Month, FUN =mean)
## all_data_v3$member_casual all_data_v3$Month all_data_v3$ride_length
## 1 casual January 959.1474
## 2 member January 646.1426
## 3 casual February 1109.5114
## 4 member February 641.4960
## 5 casual March 1355.4222
## 6 member March 686.1239
## 7 casual April 1318.0061
## 8 member April 668.7981
## 9 casual May 1454.2127
## 10 member May 773.2102
## 11 casual June 1342.9415
## 12 member June 805.6161
## 13 casual July 1341.2215
## 14 member July 798.4347
## 15 casual August 1240.4946
## 16 member August 774.3197
## 17 casual September 1154.8565
## 18 member September 744.7380
## 19 casual October 1307.7080
## 20 member October 720.8491
## 21 casual November 1063.0526
## 22 member November 647.2846
## 23 casual December 1024.2297
## 24 member December 638.9638
For make the data easier to see, we will make the visualization.
library("ggplot2")
library("lubridate")
#Analyze number of rides on weekday by user type
all_data_v3 %>%
mutate(weekday = lubridate::wday(started_date, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday, rideable_type) %>% #groups by usertype and weekday'
summarise(number_of_rides = n(), #calculates the number of rides and average duration
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) %>% # sorts
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Number of Rides on Weekday", subtitle = "Based on same-day trip in October 2021 - September 2022", x="Day", y="Number of Rides", fill="User type") +
theme(plot.title =element_text(hjust = 0.5), plot.subtitle =element_text(hjust = 0.5) )
From the graphic, we can see that the number of rides for member users is higher than casual users.
The highest number of rides by member users is on Tuesday, and the highest number of rides by casual users is on Saturday.
#Analyze number of rides on weekday by user type and bike type
all_data_v3 %>%
mutate(weekday = lubridate::wday(started_date, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday, rideable_type) %>% #groups by usertype and weekday'
summarise(number_of_rides = n(), #calculates the number of rides and average duration
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) %>% # sorts
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Number of Rides on Weekday", subtitle = "Based on same-day trip in October 2021 - September 2022", x="Day", y="Number of Rides", fill="User type") +
theme(plot.title =element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5)) +
facet_wrap("rideable_type",ncol=2,nrow =2)
From the graph, we can see the highest number of rides by casual users and member users with the classic bike are on Tuesday and Saturday, consecutively.
The highest number of rides by casual users and member users with electric bikes are on Saturday and Wednesday, consecutively.
The highest number of rides by casual users with the docked bike is on Saturday.
#Analyze average duration on weekday by user type
all_data_v3 %>%
mutate(weekday = lubridate::wday(started_date, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday'
summarise(number_of_rides = n(), #calculates the number of rides and average duration
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) %>% # sorts
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average Duration on Weekday",subtitle = "Based on same-day trip in October 2021 - September 2022", x="Day", y="Average Duration (in secs)", fill="User type") +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
We can see from the graphic that casual users have a longer average trip duration per day than member users, with the highest being on Saturdays. The highest average trip duration for member users is on Saturday and Sunday.
#Analyze average duration on weekday by user type
all_data_v3 %>%
mutate(weekday = lubridate::wday(started_date, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday, rideable_type) %>% #groups by usertype and weekday'
summarise(number_of_rides = n(), #calculates the number of rides and average duration
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) %>% # sorts
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average Duration on Weekday",subtitle = "Based on same-day trip in October 2021 - September 2022", x="Day", y="Average Duration (in secs)", fill="User type") +
theme(plot.title =element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
facet_wrap("rideable_type",ncol=2,nrow =2)
We can see the average trip duration of the docked bike are higher than other types of bikes, with the highest being on Sunday.
For the classic bike, the highest trip average duration are by casual users on Saturday and Sunday. Then, the highest trip average duration for the classic bike by member users are also on Saturday and Sunday.
For the electric bike, the highest trip average duration are by casual users on Saturday and Sunday. Then, the highest trip average duration for the electric bike by member users are on Saturday and Sunday.
library("ggplot2")
library("lubridate")
library("zoo")
## Warning: package 'zoo' was built under R version 4.2.2
#Analyze number of rides on weekday by user type
all_data_v3 %>%
mutate(month_name = as.yearmon(started_date, "%m/%Y")) %>% #creates month field using yearmon
group_by(member_casual, month_name, rideable_type) %>% #groups by usertype and weekday'
summarise(number_of_rides = n(), #calculates the number of rides and average duration
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, month_name) %>% # sorts
ggplot(aes(x = factor(month_name), y = number_of_rides, fill = member_casual)) +
geom_bar(position = "stack", stat="identity") +
labs(title = "Number of Rides per Month", subtitle = "Based on same-day trip in October 2021 - September 2022", x="Month", y="Number of Rides", fill="User type") +
theme(plot.title =element_text(hjust = 0.5), plot.subtitle =element_text(hjust = 0.5) )
For monthly number rides, we get that the number of bicycle usage increases towards summer with the highest number in August 2022, and decreases towards winter with the lowest number in January 2021.
From the analysis that we did, we get that:
From the conclusions, we can make some recommendations, such as: