0.1 Introduction:

Being in the last course of Google Data Analytics Certification, in this part we are going to a case study based on a dataset of bicycles. In this step, we will cover all the steps that we have covered during this journey in data analysis steps.

In this case, all steps in data analysis such as asking, Preparing, Process, Share and acting.

0.1.1 Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

0.1.2 About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

0.1.3 Ask

Three questions will guide the future marketing program:

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You will produce a report with the following deliverables:

  • A clear statement of the business task
  • A description of all data sources used
  • Documentation of any cleaning or manipulation of data
  • A summary of your analysis
  • Supporting visualizations and key findings
  • Your top three recommendations based on your analysis

Guiding questions

What is the problem you are trying to solve?

  • In this case, we try to answer how it is possible to use the bicycle dataset and help the company to increase the number of annual members. The stockholders want to know the differences between annual and casual members. How to use digital marketing to make promotion about marketing tactics.

How can your insights drive business decisions?

  • The use of data which are collections of facts including historical bikes will help in decision making which will help stakeholders to make decisions based on the data. Overall, data-driven decision-making can help businesses make more informed and objective decisions, reduce bias, and optimize business operations. By leveraging data analysis tools and techniques, businesses can gain insights that would otherwise be difficult or impossible to obtain and make decisions that are based on empirical evidence rather than gut instinct.

0.1.4 Prepare

For the preparation of the data, we will use the Cyclistics history trip data which you can download via this link. The data returned for a period of one year, ranging from 01-2022 to 12-2022. The extraction and manipulation of data are done for greater clarity of the processes that we will follow.

NOTICE!

This data is provided according to the Divvy Data License Agreement and released on a monthly schedule.

Seeing the size of the Cyclistic bike-share dataset, the software that we are choosing to analyze this dataset is R programming language and the RStudio IDE (Integrated Development Environment). the size of the dataset led us to use RStudio instead of using spreadsheets such as google sheets.

0.1.5 Process

Overall, data processing is a critical step in turning raw data into meaningful insights that can inform decision-making and drive business success. By following a structured process for data processing, organizations can ensure that their data is accurate, complete, and actionable.

In this work we gonna use the tidyverse which nests many packages (ggplot2, dplyr, tydir, readr, purrr, tibble, stringr and forcats) stacked in R for cleaning, transforming, and handling data.

First we will install the necessary packages to do the exploratory data analysis (EDA). The install.packages() function will be used:

  • install.packages('tidyverse')
  • install.packages('lubridate')

After installation this will be loaded tidyverse, lubridate, janitor

library(tidyverse)
library(lubridate)
library(janitor)
library(gt)

Then we will import our data which is in our machine in CSV format (Comma Separated Value), all data that we import will assign to a new variable using the read_csv() function.

# Read data
Jan2022 <- read_csv("../data/202201-divvy-tripdata.csv")
Feb2022 <- read_csv("../data/202202-divvy-tripdata.csv")
Mar2022 <- read_csv("../data/202203-divvy-tripdata.csv")
Apr2022 <- read_csv("../data/202204-divvy-tripdata.csv")
May2022 <- read_csv("../data/202205-divvy-tripdata.csv")
Jun2022 <- read_csv("../data/202206-divvy-tripdata.csv")
Jul2022 <- read_csv("../data/202207-divvy-tripdata.csv")
Aug2022 <- read_csv("../data/202208-divvy-tripdata.csv")
Sep2022 <- read_csv("../data/202209-divvy-publictripdata.csv")
Oct2022 <- read_csv("../data/202210-divvy-tripdata.csv")
Nov2022 <- read_csv("../data/202211-divvy-tripdata.csv")
Dec2022 <- read_csv("../data/202212-divvy-tripdata.csv")

Now it’s time to make some exploration using functions to see columns name the structure of data.

colnames(Jan2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Feb2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Apr2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

We want to see column names with the type of each attribute using the glimpse() function is a good way to make this process.

glimpse(Jan2022)
## Rows: 103,770
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at           <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…
glimpse(Feb2022)
## Rows: 115,609
## Columns: 13
## $ ride_id            <chr> "E1E065E7ED285C02", "1602DCDC5B30FFE3", "BE7DD2AF4B…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <dttm> 2022-02-19 18:08:41, 2022-02-20 17:41:30, 2022-02-…
## $ ended_at           <dttm> 2022-02-19 18:23:56, 2022-02-20 17:45:56, 2022-02-…
## $ start_station_name <chr> "State St & Randolph St", "Halsted St & Wrightwood …
## $ start_station_id   <chr> "TA1305000029", "TA1309000061", "TA1305000029", "13…
## $ end_station_name   <chr> "Clark St & Lincoln Ave", "Southport Ave & Wrightwo…
## $ end_station_id     <chr> "13179", "TA1307000113", "13011", "13323", "TA13070…
## $ start_lat          <dbl> 41.88462, 41.92914, 41.88462, 41.94815, 41.88462, 4…
## $ start_lng          <dbl> -87.62783, -87.64908, -87.62783, -87.66394, -87.627…
## $ end_lat            <dbl> 41.91569, 41.92877, 41.87926, 41.95283, 41.88584, 4…
## $ end_lng            <dbl> -87.63460, -87.66391, -87.63990, -87.64999, -87.635…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

NOTICE!

In the same we can use glimpse() function it’s also possible to use the colnames() function or str() to get an overview of our DataFrame.

This can be used to view all data with their characteristics.

We can also use the head() function to see the first rows of the data table. The head() function can be particularly useful when working with large data sets, as it allows you to quickly get a sense of the structure and content of the data without having to view the entire data set.

print(head(Jan2022))
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id

After looking at all these DataFrame that have the same columns now is very useful to merge data. The bind_rows() function is a built-in function in R that is used to combine multiple DataFrames row-wise (i.e., stacking them on top of each other). It is part of the dplyr package, which is a popular package for data manipulation and analysis in R.

# Using the bind_row() to combine all DataFrame in one
bike_df <- bind_rows(Jan2022, Feb2022, Mar2022, Apr2022, May2022, Jun2022, Jul2022, Aug2022, Sep2022, Oct2022, Nov2022, Dec2022)

We can also use head() function to see the first six rows. The head() function is a built-in function in R that is used to view the first few rows of a data frame or matrix.

# Using a head() function to see again merges dataframe
print(head(bike_df))
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id

If you need to see data in interactive table `DT is available. The DT package is a popular package in R that provides an interface to the JavaScript library DataTables. DataTables is a powerful library for creating interactive and customizable tables in web pages, and the DT package allows you to easily create and manipulate DataTables within R.

# Show interactive table using the DT package and pipe operator to nest fonctions.
library(DT)
bike_df %>% 
  head(5) %>% 
  gt()
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwood Ave & Touhy Ave 525 Clark St & Touhy Ave RP-007 42.01280 -87.66591 42.01256 -87.67437 casual
A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwood Ave & Touhy Ave 525 Clark St & Touhy Ave RP-007 42.01276 -87.66597 42.01256 -87.67437 casual
BD0F91DFF741C66D classic_bike 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffield Ave & Fullerton Ave TA1306000016 Greenview Ave & Fullerton Ave TA1307000001 41.92560 -87.65371 41.92533 -87.66580 member
CBB80ED419105406 classic_bike 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark St & Bryn Mawr Ave KA1504000151 Paulina St & Montrose Ave TA1309000021 41.98359 -87.66915 41.96151 -87.67139 casual
DDC963BFDDA51EEA classic_bike 2022-01-20 01:31:10 2022-01-20 01:37:12 Michigan Ave & Jackson Blvd TA1309000002 State St & Randolph St TA1305000029 41.87785 -87.62408 41.88462 -87.62783 member

Following code chunks will be used for this ‘Process’ phase for bike_df.

# checking merged data frame
colnames(bike_df)  #List of column names
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
head(bike_df)  #See the first 6 rows of data frame.  Also tail(bike_data)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
str(bike_df)  #See list of columns and data types (numeric, character, etc)
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(bike_df)  #Statistical summary of data. Mainly for numeric.
##    ride_id          rideable_type        started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5667717    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00                     
##  NA's   :5858    NA's   :5858
## Adding date, month, year, day of week columns
bike_df <- bike_df %>% 
  mutate(year = format(as.Date(started_at), "%Y")) %>% # extract year
  mutate(month = format(as.Date(started_at), "%B")) %>% #extract month
  mutate(date = format(as.Date(started_at), "%d")) %>% # extract date
  mutate(day_of_week = format(as.Date(started_at), "%A")) %>% # extract day of week
  mutate(ride_length = difftime(ended_at, started_at)) %>% 
  mutate(start_time = strftime(started_at, "%H"))

# converting 'ride_length' to numeric for calculation on data

bike_df <- bike_df %>% 
  mutate(ride_length = as.numeric(ride_length))
is.numeric(bike_df$ride_length) # to check it is right format
## [1] TRUE
# adding ride distance in km
library(geosphere)
bike_df$ride_distance <- distGeo(matrix(c(bike_df$start_lng, bike_df$start_lat), ncol = 2), matrix(c(bike_df$end_lng, bike_df$end_lat), ncol = 2))

bike_df$ride_distance <- bike_df$ride_distance/1000 #distance in km
# Clean data
# and checked for quality by Divvy where ride_length was negative or 'zero'
bike_df_clean <- bike_df[!(bike_df$ride_length <= 0),]

0.1.6 Analyze

Analyzing data involves the use of statistical and computational techniques to extract insights and knowledge from data. The goal of data analysis is to identify patterns, trends, relationships, and anomalies in the data that can inform decision-making and drive business outcomes.

# Show summary data
summary(bike_df_clean)
##    ride_id          rideable_type        started_at                    
##  Length:5667186     Length:5667186     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:20:00.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:01:59.50  
##                                        Mean   :2022-07-20 07:19:14.76  
##                                        3rd Qu.:2022-09-16 07:18:50.75  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667186     Length:5667186    
##  1st Qu.:2022-05-28 19:41:54.25   Class :character   Class :character  
##  Median :2022-07-22 15:22:49.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:38:41.61                                        
##  3rd Qu.:2022-09-16 07:36:10.75                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667186     Length:5667186     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual          year          
##  Min.   : 0.00   Min.   :-88.14   Length:5667186     Length:5667186    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                                        
##  3rd Qu.:41.93   3rd Qu.:-87.63                                        
##  Max.   :42.37   Max.   :  0.00                                        
##  NA's   :5858    NA's   :5858                                          
##     month               date           day_of_week         ride_length     
##  Length:5667186     Length:5667186     Length:5667186     Min.   :      1  
##  Class :character   Class :character   Class :character   1st Qu.:    349  
##  Mode  :character   Mode  :character   Mode  :character   Median :    617  
##                                                           Mean   :   1167  
##                                                           3rd Qu.:   1108  
##                                                           Max.   :2483235  
##                                                                            
##   start_time        ride_distance     
##  Length:5667186     Min.   :   0.000  
##  Class :character   1st Qu.:   0.873  
##  Mode  :character   Median :   1.575  
##                     Mean   :   2.140  
##                     3rd Qu.:   2.781  
##                     Max.   :9817.319  
##                     NA's   :5858

Conduct descriptive analysis. Run a few calculations in one file to get a better sense of the data layout. Options:

  • Calculate the mean of ride_length
  • Calculate the max ride_length
  • Calculate the mode of day_of_week
bike_df_clean %>% 
  summarise(average_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length)) %>% gt()
average_ride_length median_length max_ride_length
1166.846 617 2483235
  • Calculate the average ride_length for members and casual riders. Try rows = member_casual; Values = Average of ride_length.
bike_df_clean %>% 
  group_by(member_casual) %>% 
  summarise(rides = length(ride_id),
            ride_pct = (length(ride_id) / nrow(bike_df_clean)) * 100) %>% 
  gt()
member_casual rides ride_pct
casual 2321769 40.96864
member 3345417 59.03136
bike_df_clean %>% 
  group_by(rows = member_casual) %>% 
  summarise(Values = mean(ride_length)) %>% 
  ggplot(aes(x = rows, y = Values, fill = rows)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(title = "Number of trips",
       y = "Number of rides",
       x = "Members types",
       subtitle = "Number of trips made by Casual and members")+
  theme_minimal()

For the year 2022 we see that the number of occasional members have made the longest journeys compared to permanent members, i.e. respective percentages of 59% against 40.9%

  • Calculate the average ride_length for users by day_of_week. Try columns = day_of_week; Rows = member_casual; Values = Average of ride_length.
bike_df_clean %>% 
    group_by(columns = day_of_week, Rows = member_casual) %>% 
    summarise(Values = mean(ride_length), .groups='drop') %>% 
  arrange(Rows, columns)  %>% 
  ggplot(mapping = aes(x = columns, y = Values, fill = Rows)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_fill_viridis_d() +
  labs(title = "Average of Ride Length", 
       subtitle = "the average ride length for users by day of week",
       x = "Days of Week",
       y = "Ride Length Average") +
  theme_minimal()

Looking across weekdays, occasional users have the longest journey lengths on average.

  • Calculate the number of rides for users by day_of_week by adding Count of trip_id to Values.
bike_df_clean %>% 
    group_by(columns = day_of_week) %>% 
    summarise(Values = length(ride_id)) %>% 
  ggplot(mapping = aes(x = reorder(columns, -Values), y = Values,
                       fill = columns)) +
  scale_fill_viridis_d() +
  geom_col(show.legend = FALSE) +
  #theme(legend.position="none") +
  labs(title = "Nomber of Rides",
       subtitle = "number of rides for users by day of week",
       x = "Day of Week",
       y = "Number of Rides",
       caption = "Cyclistic trip data") +
  theme_minimal()

The use of bicycles is more important during the weekend, the days of Saturday register the greatest number of users.

bike_df_clean %>% 
    group_by(Days = day_of_week, members = member_casual) %>% 
    summarise(Values = length(rideable_type),.groups='drop') %>% 
  ggplot(mapping = aes(x = Days , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rideable Type",
       subtitle = "number and type of bicycles used per user per day",
       x = "Days",
       y = "Number of used bicylcle",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

On the other hand, if we focus on the type of gear used in the field, we can clearly see that permanent users use bicycles more than occasional users.

Calculate the type of bike used by the different users

bike_df_clean %>% 
  group_by(Days = day_of_week, members = member_casual, type = rideable_type) %>% 
  summarise(Values = length(rideable_type),.groups='drop') %>% 
  ggplot(mapping = aes(x = type , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rideable Type",
       subtitle = "number and type of bicycles used per user",
       x = "Types of Bicycles",
       y = "Number of Bicylcle Type",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

The casual use more type of bike three types of bike than members two types of bike.

NOTICE!

The types called Docked Bike are only used by Casual users.

# Reordered month correctly

bike_df_clean$month <- ordered(bike_df_clean$month, 
                            levels=c("janvier", "février", "mars", "avril", "mai", "juin", "juillet", "août", "septembre", "octobre", "novembre", "décembre"))

bike_df_clean %>% 
    group_by(month = month, members = member_casual) %>% 
    summarise(Values = mean(ride_length),.groups='drop') %>% 
  ggplot(mapping = aes(x = month , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rides Averages",
       subtitle = "average of rides for member and casual by month",
       x = "Months",
       y = "Rides Averages",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

As in weekdays, occasional users also have a greater presence during all months of the year.

Create Maps for Geographic Visualization

With the size of the dataset, we are going to map the number of casual and members in the month of march to make a comparison with ride length and the difference between the two groups of users by the DuSable Lake Shore Dr & Monroe St station start_station_name

For this analysis we are going to use mapview packages. Mapview is an R package that provides an interactive and easy-to-use interface for visualizing spatial data on interactive maps. The package is built on top of the leaflet JavaScript library, which allows for the creation of interactive and customizable web maps.

# Loading the two geospatial packages to map data
library(mapview)
library(leafsync)

Use filter function to extract the data based on casual users the month the type of bicycle and the start station name

casual_electric_SDGA <- bike_df_clean %>%
  filter(member_casual == "casual" & month == "mars" & rideable_type == "electric_bike" & 
           start_station_name == "DuSable Lake Shore Dr & Monroe St") %>% 
   select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)

Use filter function to extract the data based on members users the month the type of bicycle and the start station name

member_electric_SDGA <- bike_df_clean %>%
  filter(member_casual == "member" & month == "mars" & rideable_type == "electric_bike" & 
           start_station_name == "DuSable Lake Shore Dr & Monroe St") %>% 
  select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)
casual <- mapview(casual_electric_SDGA, xcol = "Longitude",
        ycol = "Latitude",
        crs = 4326, grid = FALSE,cex = "ride_length",
        zcol = "ride_length",
        #col.regions = "tomato",
        layer.name = "Casual",
        zoom = 19,
        use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))


members <- mapview(member_electric_SDGA, xcol = "Longitude",
        ycol = "Latitude",
        crs = 4326, grid = FALSE,cex = "ride_length",
        zcol = "ride_length",
        #col.regions = "tomato",
        layer.name = "Members",
        zoom = 19,
        use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))

sync(casual, members)

The analysis that we have making on the month of march show clearly that the casual users is most important than the members’ users. The two maps are synchronous to show this analytical thinking.

Now it’s time to share with Stackholders.

0.1.7 Share

Now that you have performed your analysis and gained some insights into your data, create visualizations to share your findings. Moreno has reminded you that they should be sophisticated and polished in order to effectively communicate to the executive team. Use the following Case Study Roadmap as a guide:

Key findings:

In general we can see a situation where occasional users change the way our brain predicted the results.

  • For the year 2022 we see that the number of occasional members have made the longest journeys compared to permanent members, i.e. respective percentages of 59% against 40.9%;
  • Looking across weekdays, occasional users have the longest journey lengths on average.
  • the use of bicycles is more important during the weekend, the days of Saturday register the greatest number of users.
  • On the other hand, if we focus on the type of gear used in the field, we can clearly see that permanent users use bicycles more than occasional users.
  • As in weekdays, occasional users also have a greater presence during all months of the year.

For sharing this work with wi gonna use Tableau. Tableau is a powerful tool for data visualization and analysis, and it can be used in a wide variety of settings, from business and finance to healthcare and education. With its user-friendly interface and powerful features, Tableau has become a popular choice for data analysts, business intelligence professionals, and other users who need to work with data on a regular basis.

In this dashboard we can see different graphs to show numbers and statistics between Casuals and Members, this dashboard can help stakeholders to make Data-Driven Decisions Making and optimize the capacity to transform casuals into members.

Dashboard

The dashboard can viewing using this URL Tableau_Public

0.1.8 Acte

After the analysis and Sharing, the finding of this work are:

  • occasional users seem to enjoy adventures more than permanent members, perhaps due to a desire to cycle.

  • At the level of every month we see a fairly large number of occasional users than permanent users. We even see that occasional users constitute 2/3 of the total users.

  • The length of the linears made by occasional users are greater than permanent users.

For the recommendations:

  • I think that doing advertising campaigns could add value in order to increase the number of permanent members.

  • Try to do as much promotion as possible to encourage occasional users to join as permanent members.

  • During the high intensity months (March, April, May, June) make a good impression in terms of customer management in order to attract more members.

0.2 Conclusion

This is an interesting Case Study based on the Bicycle dataset. For this work, tools such as R programming language, Spreadsheets, and Tableau Public Software for Sharing our findings. Thanks a lot.

NOTICE!

The dashboard can viewing using this URL Portfolio;Tableau_Public