Google Data Analyst Capstone
Case Study: Cyclistic bike-share analysis
Excel, RStudio, Tableau
You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals—as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
In this activity I will use a fictional dataset to conduct a data analysis project following the six steps of a good data analyst presentation: ask, prepare, process, analyze, share, and act. For the purpose of this project, I will treat the data as if it were real. However, at the end of this activity, I will include a small section to discuss the dataset from a realistic perspective.
Ask
In this phase I focused on understanding the needs of the stakeholders.
- Understanding how casual riders and annual members use the service differently.
- Helping the team designing a new marketing strategy to convert casual riders into annual members.
- Producing data visualizations to identify trends.
Prepare
During the “prepare” phase I downloaded the dataset from June 2023 to March 2024. The dataset is divided in 12 .csv files containing a total amount of more than five millions of rows. Due to the big size of the files I decided to work on each file on excel before merging them together. I made sure that the headings were consistent across all of the files and with the same format. After this process I created for each file an additional column called “ride_length” subtracting the “start_at” column (start date and time of the rent of the bike) from the “end_at” (end date and time of the rent). The initial columns were in a “dd/mm/yyyy hh:mm:ss” format. The “ride_length” column resulting from this process has been formatted in “time category” and then mutiplied by 86400 in order to get the seconds for each ride. =(D2-C2)*86400.
I made sure to copy and paste these value in values in order to eliminate the formula from the cells.
After this activity I decided to eliminate few columns not needed for this case study because not needed for the pourpose of the project. After the last step the columns of each file were: ride_id (unique code for each ride), rideable_type (stating if the bike was a classic or an electric one), started_at (start of the rent), ended_at (end of the rent), memeber_casual (stating the type of membership) and ride_length (stating how many seconds the ride lasted).
In the final step for the cleaning phase I finished to clean the data. For every excel sheet I managed to filter the ride lengths equal or minor of zero seconds, considered outliers because eighter wrong or produced by the maintenance team during the works on the bikes.
Others columns for day and month contained in the “started_at” column will be calcultaed in a later activity during the visualization thanks to Tableau, keeping the single .csv files lighter in this phase.
Analyze
In this phase I used RStudio to collect and merge the spreadsheets of the separate months. You can follow verey steps marked by an ashtag to understand the whole process.
#Loading the library.
library(tidyverse)
library(conflicted)
conflict_prefer(“filter”, “dplyr”)
conflict_prefer(“lag”, “dplyr”)
#Retrieving the data. During this step I`m uploading the single datasets in RStudio.
a202306 <- read_csv(“202306-divvy-tripdata.csv”)
a202307 <- read_csv(“202307-divvy-tripdata.csv”)
a202308 <- read_csv(“202308-divvy-tripdata.csv”)
a202309 <- read_csv(“202309-divvy-tripdata.csv”)
a202310 <- read_csv(“202310-divvy-tripdata.csv”)
a202311 <- read_csv(“202311-divvy-tripdata.csv”)
a202312 <- read_csv(“202312-divvy-tripdata.csv”)
a202401 <- read_csv(“202401-divvy-tripdata.csv”)
a202402 <- read_csv(“202402-divvy-tripdata.csv”)
a202403 <- read_csv(“202403-divvy-tripdata.csv”)
a202404 <- read_csv(“202404-divvy-tripdata.csv”)
a202405 <- read_csv(“202405-divvy-tripdata.csv”)
#Creating a dataframe. With this command I`m creating a new dataset merging the spreadsheets all together.
df_bikes <- bind_rows(a202306, a202307, a202308, a202309, a202310, a202311, a202312, a202401, a202402, a202403, a202404, a202405)
#Inspecting new dataset. These commands are very useful to understand the structure of the new .csv file.
glimpse(df_bikes)
str(df_bikes)
table(df_bikes$member_casual)
summary(df_bikes)
# Here I`m crating a small sample of the new dataset. Loading the whole document would take a very long amount of time.
subset_data <- head(df_bikes, 100)
view(subset_data)
#Creating a new .csv file in order to create visualizations on Tableau. This will be the big file that I will use on Tableau for visualizing the data.
write.csv(df_bikes, file = ‘bikes_file.csv’)
At the end of this process I merged the spreadsheets in a single csv. file. After that I can import this file in Tableau and evaluate the trends regarding member and casual users.
Share
In the pie chart above, we can see that over the past 12 months, the goal of the stakeholders has been to increase the percentage of members. Currently, this percentage is 64.32% of the total.
Analyzing the average ride lengths, we can see that although there are more members than casual riders, the casual riders tend to cycle longer throughout the entire year. Let’s focus on this trend.
Here we have the sum of the ride lengths for the two groups. The members appear very consistent with shorter ride lengths, while the casual riders seem to ride more during the weekends. We can appreciate this trend even more by analyzing the average in the next graph.
This graph shows the usage of classic bikes and electric bikes for both groups. Classic bikes are the most used by both groups. I need to highlight an inconsistency in the dataset. The “docked bike” group does not refer to any specific bike type. This issue must be resolved with the data entry team, as some of the bikes might be tagged in the wrong category.
Act
(In the case study, it’s not stated if this service has a phone app, but I’m considering this as a fact because it would be hard to manage this kind of activity without an in-app store to manage the subscriptions.)
Let’s explore a few strategic options that we might consider to increase our member riders population.
As we have seen before, casual members tend to ride for longer periods. A strategy we might want to consider is creating a system of cashback or coupons to be earned by members when they reach a certain number of kilometers every year. This might involve local shops in Chicago, such as coffee shops or food shops. Establishing these kinds of interactions means that the participating shops might want to display our service ads in their locations, increasing our visibility.
Another option is proposing discounts for the annual subscription, especially in the period between May and August when the casual riders population rises. This could be appealing considering the average ride length increase that we observed during this period.
My third option is a bit more complex. Recently, many sport apps track activity and provide feedbacks. This kind of service could highlight the benefits of cycling. Additionally, many of the apps I mentioned have digital prize schemes. The more you cycle, the more digital prizes you get in our app. Items like digital medals, clothes for an in-app avatar, and digital bike skins. This is the basic principle of gamification. Offering challenges and rewards could be a good approach to increase our member population.
A very similar approach is currently being used by an app called “Fantasy Hike,” which tracks steps. Fantasy Hike is a mobile application designed to motivate users to walk and exercise by turning their physical activity data into a virtual adventure. The app is themed around epic fantasy journeys inspired by the popular fantasy masterpiece “The Lord of the Rings”. Using geo-data collected by the phone, GPS, after a certain number of miles walked, it unlocks chapters of the story . You can find an article from “The Verge” discussing this app here.
Final considerations
At this point, I consider this case study finished. I just want to add a small section discussing the data used in this project. During the ‘prepare’ phase, I honestly struggled a bit with deciding which portion of the data should be considered outliers. Focusing on the ride_length column, we can see that the smallest values are under one second. This is a bit odd. Should a ride lasting less than one second be considered reliable?
On the maximum side of the values we have the same kind of issues where the highest value reaches 491 hours (1767958 seconds).
In the end, I considered these data for what they are: an exercise. They are fictional data, so I decided to include everything except values equal to zero and negative. In a real-life scenario, before starting the analysis, I would have questioned the reliability of the data and involved the data entry team to understand the true nature of these values.