Being in the last course of Google Data Analytics Certification, in this part we are going to a case study based on a dataset of bicycles. In this step, we will cover all the steps that we have covered during this journey in data analysis steps.
In this case, all steps in data analysis such as asking, Preparing, Process, Share and acting.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Three questions will guide the future marketing program:
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You will produce a report with the following deliverables:
Guiding questions
What is the problem you are trying to solve?
How can your insights drive business decisions?
For the preparation of the data, we will use the
Cyclistics history trip data which you can download via
this link. The
data returned for a period of one year, ranging from
01-2022 to 12-2022. The extraction and
manipulation of data are done for greater clarity of the processes that
we will follow.
NOTICE!
This data is provided according to the Divvy Data License Agreement and released on a monthly schedule.
Seeing the size of the Cyclistic bike-share dataset, the software that we are choosing to analyze this dataset is R programming language and the RStudio IDE (Integrated Development Environment). the size of the dataset led us to use RStudio instead of using spreadsheets such as google sheets.
Overall, data processing is a critical step in turning raw data into meaningful insights that can inform decision-making and drive business success. By following a structured process for data processing, organizations can ensure that their data is accurate, complete, and actionable.
In this work we gonna use the tidyverse which nests many packages (ggplot2, dplyr, tydir, readr, purrr, tibble, stringr and forcats) stacked in R for cleaning, transforming, and handling data.
First we will install the necessary packages to do the exploratory
data analysis (EDA). The install.packages() function will
be used:
install.packages('tidyverse')install.packages('lubridate')After installation this will be loaded tidyverse,
lubridate, janitor
library(tidyverse)
library(lubridate)
library(janitor)
library(gt)Then we will import our data which is in our machine in CSV format (Comma Separated Value), all data that we import will assign to a new variable using the read_csv() function.
# Read data
Jan2022 <- read_csv("../data/202201-divvy-tripdata.csv")
Feb2022 <- read_csv("../data/202202-divvy-tripdata.csv")
Mar2022 <- read_csv("../data/202203-divvy-tripdata.csv")
Apr2022 <- read_csv("../data/202204-divvy-tripdata.csv")
May2022 <- read_csv("../data/202205-divvy-tripdata.csv")
Jun2022 <- read_csv("../data/202206-divvy-tripdata.csv")
Jul2022 <- read_csv("../data/202207-divvy-tripdata.csv")
Aug2022 <- read_csv("../data/202208-divvy-tripdata.csv")
Sep2022 <- read_csv("../data/202209-divvy-publictripdata.csv")
Oct2022 <- read_csv("../data/202210-divvy-tripdata.csv")
Nov2022 <- read_csv("../data/202211-divvy-tripdata.csv")
Dec2022 <- read_csv("../data/202212-divvy-tripdata.csv")Now it’s time to make some exploration using functions to see columns name the structure of data.
colnames(Jan2022)## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(Feb2022)## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(Apr2022)## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
We want to see column names with the type of each attribute using the
glimpse() function is a good way to make this process.
glimpse(Jan2022)## Rows: 103,770
## Columns: 13
## $ ride_id <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual <chr> "casual", "casual", "member", "casual", "member", "…
glimpse(Feb2022)## Rows: 115,609
## Columns: 13
## $ ride_id <chr> "E1E065E7ED285C02", "1602DCDC5B30FFE3", "BE7DD2AF4B…
## $ rideable_type <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at <dttm> 2022-02-19 18:08:41, 2022-02-20 17:41:30, 2022-02-…
## $ ended_at <dttm> 2022-02-19 18:23:56, 2022-02-20 17:45:56, 2022-02-…
## $ start_station_name <chr> "State St & Randolph St", "Halsted St & Wrightwood …
## $ start_station_id <chr> "TA1305000029", "TA1309000061", "TA1305000029", "13…
## $ end_station_name <chr> "Clark St & Lincoln Ave", "Southport Ave & Wrightwo…
## $ end_station_id <chr> "13179", "TA1307000113", "13011", "13323", "TA13070…
## $ start_lat <dbl> 41.88462, 41.92914, 41.88462, 41.94815, 41.88462, 4…
## $ start_lng <dbl> -87.62783, -87.64908, -87.62783, -87.66394, -87.627…
## $ end_lat <dbl> 41.91569, 41.92877, 41.87926, 41.95283, 41.88584, 4…
## $ end_lng <dbl> -87.63460, -87.66391, -87.63990, -87.64999, -87.635…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
NOTICE!
In the same we can use glimpse() function it’s also
possible to use the colnames() function or
str() to get an overview of our
DataFrame.
This can be used to view all data with their characteristics.
We can also use the head() function to see the first
rows of the data table. The head() function can be particularly useful
when working with large data sets, as it allows you to quickly get a
sense of the structure and content of the data without having to view
the entire data set.
print(head(Jan2022))## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## # ²​start_station_name, ³​start_station_id
After looking at all these DataFrame that have the
same columns now is very useful to merge data. The
bind_rows() function is a built-in function in R that is
used to combine multiple DataFrames row-wise (i.e.,
stacking them on top of each other). It is part of the
dplyr package, which is a popular package for data
manipulation and analysis in R.
# Using the bind_row() to combine all DataFrame in one
bike_df <- bind_rows(Jan2022, Feb2022, Mar2022, Apr2022, May2022, Jun2022, Jul2022, Aug2022, Sep2022, Oct2022, Nov2022, Dec2022)We can also use head() function to see the first six
rows. The head() function is a built-in function in R that is used to
view the first few rows of a data frame or matrix.
# Using a head() function to see again merges dataframe
print(head(bike_df))## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## # ²​start_station_name, ³​start_station_id
If you need to see data in interactive table `DT is
available. The DT package is a popular package in R that
provides an interface to the JavaScript library DataTables. DataTables
is a powerful library for creating interactive and customizable tables
in web pages, and the DT package allows you to easily create and
manipulate DataTables within R.
# Show interactive table using the DT package and pipe operator to nest fonctions.
library(DT)
bike_df %>%
head(5) %>%
gt()| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C2F7DD78E82EC875 | electric_bike | 2022-01-13 11:59:47 | 2022-01-13 12:02:44 | Glenwood Ave & Touhy Ave | 525 | Clark St & Touhy Ave | RP-007 | 42.01280 | -87.66591 | 42.01256 | -87.67437 | casual |
| A6CF8980A652D272 | electric_bike | 2022-01-10 08:41:56 | 2022-01-10 08:46:17 | Glenwood Ave & Touhy Ave | 525 | Clark St & Touhy Ave | RP-007 | 42.01276 | -87.66597 | 42.01256 | -87.67437 | casual |
| BD0F91DFF741C66D | classic_bike | 2022-01-25 04:53:40 | 2022-01-25 04:58:01 | Sheffield Ave & Fullerton Ave | TA1306000016 | Greenview Ave & Fullerton Ave | TA1307000001 | 41.92560 | -87.65371 | 41.92533 | -87.66580 | member |
| CBB80ED419105406 | classic_bike | 2022-01-04 00:18:04 | 2022-01-04 00:33:00 | Clark St & Bryn Mawr Ave | KA1504000151 | Paulina St & Montrose Ave | TA1309000021 | 41.98359 | -87.66915 | 41.96151 | -87.67139 | casual |
| DDC963BFDDA51EEA | classic_bike | 2022-01-20 01:31:10 | 2022-01-20 01:37:12 | Michigan Ave & Jackson Blvd | TA1309000002 | State St & Randolph St | TA1305000029 | 41.87785 | -87.62408 | 41.88462 | -87.62783 | member |
Following code chunks will be used for this ‘Process’ phase for
bike_df.
# checking merged data frame
colnames(bike_df) #List of column names## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
head(bike_df) #See the first 6 rows of data frame. Also tail(bike_data)## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## # ²​start_station_name, ³​start_station_id
str(bike_df) #See list of columns and data types (numeric, character, etc)## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
## $ ended_at : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
## $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ start_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:5667717] 42 42 41.9 42 41.9 ...
## $ end_lng : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5667717] "casual" "casual" "member" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(bike_df) #Statistical summary of data. Mainly for numeric.## ride_id rideable_type started_at
## Length:5667717 Length:5667717 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:21:05.00
## Mode :character Mode :character Median :2022-07-22 15:03:59.00
## Mean :2022-07-20 07:21:18.74
## 3rd Qu.:2022-09-16 07:21:29.00
## Max. :2022-12-31 23:59:26.00
##
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5667717 Length:5667717
## 1st Qu.:2022-05-28 19:43:07.00 Class :character Class :character
## Median :2022-07-22 15:24:44.00 Mode :character Mode :character
## Mean :2022-07-20 07:40:45.33
## 3rd Qu.:2022-09-16 07:39:03.00
## Max. :2023-01-02 04:56:45.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5667717 Length:5667717 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5667717
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5858 NA's :5858
## Adding date, month, year, day of week columns
bike_df <- bike_df %>%
mutate(year = format(as.Date(started_at), "%Y")) %>% # extract year
mutate(month = format(as.Date(started_at), "%B")) %>% #extract month
mutate(date = format(as.Date(started_at), "%d")) %>% # extract date
mutate(day_of_week = format(as.Date(started_at), "%A")) %>% # extract day of week
mutate(ride_length = difftime(ended_at, started_at)) %>%
mutate(start_time = strftime(started_at, "%H"))
# converting 'ride_length' to numeric for calculation on data
bike_df <- bike_df %>%
mutate(ride_length = as.numeric(ride_length))
is.numeric(bike_df$ride_length) # to check it is right format## [1] TRUE
# adding ride distance in km
library(geosphere)
bike_df$ride_distance <- distGeo(matrix(c(bike_df$start_lng, bike_df$start_lat), ncol = 2), matrix(c(bike_df$end_lng, bike_df$end_lat), ncol = 2))
bike_df$ride_distance <- bike_df$ride_distance/1000 #distance in km# Clean data
# and checked for quality by Divvy where ride_length was negative or 'zero'
bike_df_clean <- bike_df[!(bike_df$ride_length <= 0),]Analyzing data involves the use of statistical and computational techniques to extract insights and knowledge from data. The goal of data analysis is to identify patterns, trends, relationships, and anomalies in the data that can inform decision-making and drive business outcomes.
# Show summary data
summary(bike_df_clean)## ride_id rideable_type started_at
## Length:5667186 Length:5667186 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:20:00.00
## Mode :character Mode :character Median :2022-07-22 15:01:59.50
## Mean :2022-07-20 07:19:14.76
## 3rd Qu.:2022-09-16 07:18:50.75
## Max. :2022-12-31 23:59:26.00
##
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5667186 Length:5667186
## 1st Qu.:2022-05-28 19:41:54.25 Class :character Class :character
## Median :2022-07-22 15:22:49.00 Mode :character Mode :character
## Mean :2022-07-20 07:38:41.61
## 3rd Qu.:2022-09-16 07:36:10.75
## Max. :2023-01-02 04:56:45.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5667186 Length:5667186 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual year
## Min. : 0.00 Min. :-88.14 Length:5667186 Length:5667186
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character Class :character
## Median :41.90 Median :-87.64 Mode :character Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5858 NA's :5858
## month date day_of_week ride_length
## Length:5667186 Length:5667186 Length:5667186 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 349
## Mode :character Mode :character Mode :character Median : 617
## Mean : 1167
## 3rd Qu.: 1108
## Max. :2483235
##
## start_time ride_distance
## Length:5667186 Min. : 0.000
## Class :character 1st Qu.: 0.873
## Mode :character Median : 1.575
## Mean : 2.140
## 3rd Qu.: 2.781
## Max. :9817.319
## NA's :5858
Conduct descriptive analysis. Run a few calculations in one file to get a better sense of the data layout. Options:
bike_df_clean %>%
summarise(average_ride_length = mean(ride_length), median_length = median(ride_length),
max_ride_length = max(ride_length)) %>% gt()| average_ride_length | median_length | max_ride_length |
|---|---|---|
| 1166.846 | 617 | 2483235 |
bike_df_clean %>%
group_by(member_casual) %>%
summarise(rides = length(ride_id),
ride_pct = (length(ride_id) / nrow(bike_df_clean)) * 100) %>%
gt()| member_casual | rides | ride_pct |
|---|---|---|
| casual | 2321769 | 40.96864 |
| member | 3345417 | 59.03136 |
bike_df_clean %>%
group_by(rows = member_casual) %>%
summarise(Values = mean(ride_length)) %>%
ggplot(aes(x = rows, y = Values, fill = rows)) +
geom_col() +
scale_fill_viridis_d() +
labs(title = "Number of trips",
y = "Number of rides",
x = "Members types",
subtitle = "Number of trips made by Casual and members")+
theme_minimal()For the year 2022 we see that the number of occasional members have made the longest journeys compared to permanent members, i.e. respective percentages of 59% against 40.9%
bike_df_clean %>%
group_by(columns = day_of_week, Rows = member_casual) %>%
summarise(Values = mean(ride_length), .groups='drop') %>%
arrange(Rows, columns) %>%
ggplot(mapping = aes(x = columns, y = Values, fill = Rows)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_fill_viridis_d() +
labs(title = "Average of Ride Length",
subtitle = "the average ride length for users by day of week",
x = "Days of Week",
y = "Ride Length Average") +
theme_minimal()Looking across weekdays, occasional users have the longest journey lengths on average.
bike_df_clean %>%
group_by(columns = day_of_week) %>%
summarise(Values = length(ride_id)) %>%
ggplot(mapping = aes(x = reorder(columns, -Values), y = Values,
fill = columns)) +
scale_fill_viridis_d() +
geom_col(show.legend = FALSE) +
#theme(legend.position="none") +
labs(title = "Nomber of Rides",
subtitle = "number of rides for users by day of week",
x = "Day of Week",
y = "Number of Rides",
caption = "Cyclistic trip data") +
theme_minimal()The use of bicycles is more important during the weekend, the days of Saturday register the greatest number of users.
bike_df_clean %>%
group_by(Days = day_of_week, members = member_casual) %>%
summarise(Values = length(rideable_type),.groups='drop') %>%
ggplot(mapping = aes(x = Days , y = Values,
fill = members)) +
scale_fill_viridis_d() +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
#theme(legend.position="none") +
labs(title = "Rideable Type",
subtitle = "number and type of bicycles used per user per day",
x = "Days",
y = "Number of used bicylcle",
caption = "Cyclistic trip data") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))On the other hand, if we focus on the type of gear used in the field, we can clearly see that permanent users use bicycles more than occasional users.
Calculate the type of bike used by the different users
bike_df_clean %>%
group_by(Days = day_of_week, members = member_casual, type = rideable_type) %>%
summarise(Values = length(rideable_type),.groups='drop') %>%
ggplot(mapping = aes(x = type , y = Values,
fill = members)) +
scale_fill_viridis_d() +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
#theme(legend.position="none") +
labs(title = "Rideable Type",
subtitle = "number and type of bicycles used per user",
x = "Types of Bicycles",
y = "Number of Bicylcle Type",
caption = "Cyclistic trip data") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))The casual use more type of bike three types of bike than members two types of bike.
NOTICE!
The types called Docked Bike are only used by Casual users.
# Reordered month correctly
bike_df_clean$month <- ordered(bike_df_clean$month,
levels=c("janvier", "février", "mars", "avril", "mai", "juin", "juillet", "août", "septembre", "octobre", "novembre", "décembre"))
bike_df_clean %>%
group_by(month = month, members = member_casual) %>%
summarise(Values = mean(ride_length),.groups='drop') %>%
ggplot(mapping = aes(x = month , y = Values,
fill = members)) +
scale_fill_viridis_d() +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
#theme(legend.position="none") +
labs(title = "Rides Averages",
subtitle = "average of rides for member and casual by month",
x = "Months",
y = "Rides Averages",
caption = "Cyclistic trip data") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))As in weekdays, occasional users also have a greater presence during all months of the year.
Create Maps for Geographic Visualization
With the size of the dataset, we are going to map the number of casual and members in the month of march to make a comparison with ride length and the difference between the two groups of users by the DuSable Lake Shore Dr & Monroe St station start_station_name
For this analysis we are going to use mapview packages.
Mapview is an R package that provides an interactive and
easy-to-use interface for visualizing spatial data on interactive maps.
The package is built on top of the leaflet JavaScript library, which
allows for the creation of interactive and customizable web maps.
# Loading the two geospatial packages to map data
library(mapview)
library(leafsync)Use filter function to extract the data based on
casual users the month the type of
bicycle and the start station name
casual_electric_SDGA <- bike_df_clean %>%
filter(member_casual == "casual" & month == "mars" & rideable_type == "electric_bike" &
start_station_name == "DuSable Lake Shore Dr & Monroe St") %>%
select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)Use filter function to extract the data based on
members users the month the type of
bicycle and the start station name
member_electric_SDGA <- bike_df_clean %>%
filter(member_casual == "member" & month == "mars" & rideable_type == "electric_bike" &
start_station_name == "DuSable Lake Shore Dr & Monroe St") %>%
select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)casual <- mapview(casual_electric_SDGA, xcol = "Longitude",
ycol = "Latitude",
crs = 4326, grid = FALSE,cex = "ride_length",
zcol = "ride_length",
#col.regions = "tomato",
layer.name = "Casual",
zoom = 19,
use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))
members <- mapview(member_electric_SDGA, xcol = "Longitude",
ycol = "Latitude",
crs = 4326, grid = FALSE,cex = "ride_length",
zcol = "ride_length",
#col.regions = "tomato",
layer.name = "Members",
zoom = 19,
use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))
sync(casual, members)The analysis that we have making on the month of march show clearly that the casual users is most important than the members’ users. The two maps are synchronous to show this analytical thinking.
Now it’s time to share with Stackholders.
After the analysis and Sharing, the finding of this work are:
occasional users seem to enjoy adventures more than permanent members, perhaps due to a desire to cycle.
At the level of every month we see a fairly large number of occasional users than permanent users. We even see that occasional users constitute 2/3 of the total users.
The length of the linears made by occasional users are greater than permanent users.
For the recommendations:
I think that doing advertising campaigns could add value in order to increase the number of permanent members.
Try to do as much promotion as possible to encourage occasional users to join as permanent members.
During the high intensity months (March, April, May, June) make a good impression in terms of customer management in order to attract more members.
This is an interesting Case Study based on the Bicycle dataset. For this work, tools such as R programming language, Spreadsheets, and Tableau Public Software for Sharing our findings. Thanks a lot.
NOTICE!
The dashboard can viewing using this URL Portfolio;Tableau_Public