Google Data Analyst Capstone

Case Study: Cyclistic bike-share analysis

Excel, RStudio, Tableau

Ask

In this phase I focused on understanding the needs of the stakeholders.

  • Understanding how casual riders and annual members use the service differently.
  • Helping the team designing a new marketing strategy to convert casual riders into annual members.
  • Producing data visualizations to identify trends.

Prepare

During the “prepare” phase I downloaded the dataset from June 2023 to March 2024. The dataset is divided in 12 .csv files containing a total amount of more than five millions of rows. Due to the big size of the files I decided to work on each file on excel before merging them together. I made sure that the headings were consistent across all of the files and with the same format. After this process I created for each file an additional column called “ride_length” subtracting the “start_at” column (start date and time of the rent of the bike) from the “end_at” (end date and time of the rent). The initial columns were in a “dd/mm/yyyy hh:mm:ss” format. The “ride_length” column resulting from this process has been formatted in  “time category” and then mutiplied by 86400 in order to get the seconds for each ride. =(D2-C2)*86400.

I made sure to copy and paste these value in values in order to eliminate the formula from the cells.

After this activity I decided to eliminate few columns not needed for this case study because not needed for the pourpose of the project. After the last step the columns of each file were: ride_id (unique code for each ride), rideable_type (stating if the bike was a classic or an electric one), started_at (start of the rent), ended_at (end of the rent), memeber_casual (stating the type of membership) and ride_length (stating how many seconds the ride lasted).

In the final step for the cleaning phase I finished to clean the data. For every excel sheet I managed to filter the ride lengths equal or minor of zero seconds, considered outliers because eighter wrong or produced by the  maintenance team during the works on the bikes.

Others columns for day and month contained in the “started_at” column will be calcultaed in a later activity during the visualization thanks to Tableau, keeping the single .csv files lighter in this phase.

 

Analyze

In this phase I used RStudio to collect and merge the spreadsheets of the separate months. You can follow verey steps marked by an ashtag to understand the whole process.

 

#Loading the library.

 

library(tidyverse)
library(conflicted)
conflict_prefer(“filter”, “dplyr”)
conflict_prefer(“lag”, “dplyr”)

 

 

 

 

#Retrieving the data. During this step I`m uploading the single datasets in RStudio.

 

 

a202306 <- read_csv(“202306-divvy-tripdata.csv”)
a202307 <- read_csv(“202307-divvy-tripdata.csv”)
a202308 <- read_csv(“202308-divvy-tripdata.csv”)
a202309 <- read_csv(“202309-divvy-tripdata.csv”)
a202310 <- read_csv(“202310-divvy-tripdata.csv”)
a202311 <- read_csv(“202311-divvy-tripdata.csv”)
a202312 <- read_csv(“202312-divvy-tripdata.csv”)
a202401 <- read_csv(“202401-divvy-tripdata.csv”)
a202402 <- read_csv(“202402-divvy-tripdata.csv”)
a202403 <- read_csv(“202403-divvy-tripdata.csv”)
a202404 <- read_csv(“202404-divvy-tripdata.csv”)
a202405 <- read_csv(“202405-divvy-tripdata.csv”)

 

 

 

 

#Creating a dataframe. With this command I`m creating a new dataset merging the spreadsheets all together. 

 

df_bikes <- bind_rows(a202306, a202307, a202308, a202309, a202310, a202311, a202312, a202401, a202402, a202403, a202404, a202405)

 

 

 

 

#Inspecting new dataset. These commands are very useful to understand the structure of the new .csv file.

 

 

glimpse(df_bikes)
str(df_bikes)
table(df_bikes$member_casual)

summary(df_bikes)

 

 

 

 

# Here I`m crating a small sample of the new dataset. Loading the whole document would take a very long amount of time.

 

 

subset_data <- head(df_bikes, 100)

view(subset_data)

 

 

 

#Creating a new .csv file in order to create visualizations on Tableau. This will be the big file that I will use on Tableau for visualizing the data.

 

 

write.csv(df_bikes, file = ‘bikes_file.csv’)

 

 

At the end of this process I merged the spreadsheets in a single csv. file. After that I can import this file in Tableau and evaluate the trends regarding member and casual users.

 

Share

In the pie chart above, we can see that over the past 12 months, the goal of the stakeholders has been to increase the percentage of members. Currently, this percentage is 64.32% of the total.

 Focusing on the ride length, we can observe the amount of time spent cycling each month for both categories. As anticipated, the ride length drops significantly during the cold months. Overall, we can see that our members tend to be more consistent than casual riders, especially during the period between May and August.

Analyzing the average ride lengths, we can see that although there are more members than casual riders, the casual riders tend to cycle longer throughout the entire year. Let’s focus on this trend.

Here we have the sum of the ride lengths for the two groups. The members appear very consistent with shorter ride lengths, while the casual riders seem to ride more during the weekends. We can appreciate this trend even more by analyzing the average in the next graph.

 Looking at the average, we can see that the casual group tends to ride longer, especially during the weekends. This might indicate that casual riders use our service for pleasure cycling during the weekends, while the members maintain the same average ride length across the week. This consistency suggests they predominantly use the service for commuting to work or school.

This graph shows the usage of classic bikes and electric bikes for both groups. Classic bikes are the most used by both groups. I need to highlight an inconsistency in the dataset. The “docked bike” group does not refer to any specific bike type. This issue must be resolved with the data entry team, as some of the bikes might be tagged in the wrong category.

Act

(In the case study, it’s not stated if this service has a phone app, but I’m considering this as a fact because it would be hard to manage this kind of activity without an in-app store to manage the subscriptions.)

Let’s explore a few strategic options that we might consider to increase our member riders population.

As we have seen before, casual members tend to ride for longer periods. A strategy we might want to consider is creating a system of cashback or coupons to be earned by members when they reach a certain number of kilometers every year. This might involve local shops in Chicago, such as coffee shops or food shops. Establishing these kinds of interactions means that the participating shops might want to display our service ads in their locations, increasing our visibility.

Another option is proposing discounts for the annual subscription, especially in the period between May and August when the casual riders population rises. This could be appealing considering the average ride length increase that we observed during this period.

My third option is a bit more complex. Recently, many sport apps track activity and provide feedbacks. This kind of service could highlight the benefits of cycling. Additionally, many of the apps I mentioned have digital prize schemes. The more you cycle, the more digital prizes you get in our app. Items like digital medals, clothes for an in-app avatar, and digital bike skins. This is the basic principle of gamification. Offering challenges and rewards could be a good approach to increase our member population.

A very similar approach is currently being used by an app called “Fantasy Hike,” which tracks steps. Fantasy Hike is a mobile application designed to motivate users to walk and exercise by turning their physical activity data into a virtual adventure. The app is themed around epic fantasy journeys inspired by the popular fantasy masterpiece “The Lord of the Rings”. Using geo-data collected by the phone, GPS, after a certain number of miles walked, it unlocks chapters of the story . You can find an article from “The Verge” discussing this app here.

Final considerations

At this point, I consider this case study finished. I just want to add a small section discussing the data used in this project. During the ‘prepare’ phase, I honestly struggled a bit with deciding which portion of the data should be considered outliers. Focusing on the ride_length column, we can see that the smallest values are under one second. This is a bit odd. Should a ride lasting less than one second be considered reliable?

On the maximum side of the values we have the same kind of issues where the highest value reaches 491 hours (1767958 seconds).

In the end, I considered these data for what they are: an exercise. They are fictional data, so I decided to include everything except values equal to zero and negative. In a real-life scenario, before starting the analysis, I would have questioned the reliability of the data and involved the data entry team to understand the true nature of these values.