Google Data Analytics: Case Study 1

0.1 Introduction:

Being in the last course of Google Data Analytics Certification, in this part we are going to a case study based on a dataset of bicycles. In this step, we will cover all the steps that we have covered during this journey in data analysis steps.

In this case, all steps in data analysis such as asking, Preparing, Process, Share and acting.

0.1.1 Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

0.1.2 About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

0.1.3 Ask

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You will produce a report with the following deliverables:

A clear statement of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top three recommendations based on your analysis

Guiding questions

What is the problem you are trying to solve?

In this case, we try to answer how it is possible to use the bicycle dataset and help the company to increase the number of annual members. The stockholders want to know the differences between annual and casual members. How to use digital marketing to make promotion about marketing tactics.

How can your insights drive business decisions?

The use of data which are collections of facts including historical bikes will help in decision making which will help stakeholders to make decisions based on the data. Overall, data-driven decision-making can help businesses make more informed and objective decisions, reduce bias, and optimize business operations. By leveraging data analysis tools and techniques, businesses can gain insights that would otherwise be difficult or impossible to obtain and make decisions that are based on empirical evidence rather than gut instinct.

0.1.4 Prepare

For the preparation of the data, we will use the Cyclistics history trip data which you can download via this link. The data returned for a period of one year, ranging from 01-2022 to 12-2022. The extraction and manipulation of data are done for greater clarity of the processes that we will follow.

NOTICE!

This data is provided according to the Divvy Data License Agreement and released on a monthly schedule.

Seeing the size of the Cyclistic bike-share dataset, the software that we are choosing to analyze this dataset is R programming language and the RStudio IDE (Integrated Development Environment). the size of the dataset led us to use RStudio instead of using spreadsheets such as google sheets.

0.1.5 Process

Overall, data processing is a critical step in turning raw data into meaningful insights that can inform decision-making and drive business success. By following a structured process for data processing, organizations can ensure that their data is accurate, complete, and actionable.

In this work we gonna use the tidyverse which nests many packages (ggplot2, dplyr, tydir, readr, purrr, tibble, stringr and forcats) stacked in R for cleaning, transforming, and handling data.

First we will install the necessary packages to do the exploratory data analysis (EDA). The install.packages() function will be used:

install.packages('tidyverse')
install.packages('lubridate')

After installation this will be loaded tidyverse, lubridate, janitor

library(tidyverse)
library(lubridate)
library(janitor)
library(gt)

Then we will import our data which is in our machine in CSV format (Comma Separated Value), all data that we import will assign to a new variable using the read_csv() function.

# Read data
Jan2022 <- read_csv("../data/202201-divvy-tripdata.csv")
Feb2022 <- read_csv("../data/202202-divvy-tripdata.csv")
Mar2022 <- read_csv("../data/202203-divvy-tripdata.csv")
Apr2022 <- read_csv("../data/202204-divvy-tripdata.csv")
May2022 <- read_csv("../data/202205-divvy-tripdata.csv")
Jun2022 <- read_csv("../data/202206-divvy-tripdata.csv")
Jul2022 <- read_csv("../data/202207-divvy-tripdata.csv")
Aug2022 <- read_csv("../data/202208-divvy-tripdata.csv")
Sep2022 <- read_csv("../data/202209-divvy-publictripdata.csv")
Oct2022 <- read_csv("../data/202210-divvy-tripdata.csv")
Nov2022 <- read_csv("../data/202211-divvy-tripdata.csv")
Dec2022 <- read_csv("../data/202212-divvy-tripdata.csv")

Now it’s time to make some exploration using functions to see columns name the structure of data.

colnames(Jan2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(Feb2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(Apr2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

We want to see column names with the type of each attribute using the glimpse() function is a good way to make this process.

glimpse(Jan2022)

## Rows: 103,770
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at           <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…

glimpse(Feb2022)

## Rows: 115,609
## Columns: 13
## $ ride_id            <chr> "E1E065E7ED285C02", "1602DCDC5B30FFE3", "BE7DD2AF4B…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <dttm> 2022-02-19 18:08:41, 2022-02-20 17:41:30, 2022-02-…
## $ ended_at           <dttm> 2022-02-19 18:23:56, 2022-02-20 17:45:56, 2022-02-…
## $ start_station_name <chr> "State St & Randolph St", "Halsted St & Wrightwood …
## $ start_station_id   <chr> "TA1305000029", "TA1309000061", "TA1305000029", "13…
## $ end_station_name   <chr> "Clark St & Lincoln Ave", "Southport Ave & Wrightwo…
## $ end_station_id     <chr> "13179", "TA1307000113", "13011", "13323", "TA13070…
## $ start_lat          <dbl> 41.88462, 41.92914, 41.88462, 41.94815, 41.88462, 4…
## $ start_lng          <dbl> -87.62783, -87.64908, -87.62783, -87.66394, -87.627…
## $ end_lat            <dbl> 41.91569, 41.92877, 41.87926, 41.95283, 41.88584, 4…
## $ end_lng            <dbl> -87.63460, -87.66391, -87.63990, -87.64999, -87.635…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

NOTICE!

In the same we can use glimpse() function it’s also possible to use the colnames() function or str() to get an overview of our DataFrame.

This can be used to view all data with their characteristics.

We can also use the head() function to see the first rows of the data table. The head() function can be particularly useful when working with large data sets, as it allows you to quickly get a sense of the structure and content of the data without having to view the entire data set.

print(head(Jan2022))

## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹rideable_type,
## #   ²start_station_name, ³start_station_id

After looking at all these DataFrame that have the same columns now is very useful to merge data. The bind_rows() function is a built-in function in R that is used to combine multiple DataFrames row-wise (i.e., stacking them on top of each other). It is part of the dplyr package, which is a popular package for data manipulation and analysis in R.

# Using the bind_row() to combine all DataFrame in one
bike_df <- bind_rows(Jan2022, Feb2022, Mar2022, Apr2022, May2022, Jun2022, Jul2022, Aug2022, Sep2022, Oct2022, Nov2022, Dec2022)

We can also use head() function to see the first six rows. The head() function is a built-in function in R that is used to view the first few rows of a data frame or matrix.

# Using a head() function to see again merges dataframe
print(head(bike_df))

## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹rideable_type,
## #   ²start_station_name, ³start_station_id

If you need to see data in interactive table `DT is available. The DT package is a popular package in R that provides an interface to the JavaScript library DataTables. DataTables is a powerful library for creating interactive and customizable tables in web pages, and the DT package allows you to easily create and manipulate DataTables within R.

# Show interactive table using the DT package and pipe operator to nest fonctions.
library(DT)
bike_df %>% 
  head(5) %>% 
  gt()

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
C2F7DD78E82EC875	electric_bike	2022-01-13 11:59:47	2022-01-13 12:02:44	Glenwood Ave & Touhy Ave	525	Clark St & Touhy Ave	RP-007	42.01280	-87.66591	42.01256	-87.67437	casual
A6CF8980A652D272	electric_bike	2022-01-10 08:41:56	2022-01-10 08:46:17	Glenwood Ave & Touhy Ave	525	Clark St & Touhy Ave	RP-007	42.01276	-87.66597	42.01256	-87.67437	casual
BD0F91DFF741C66D	classic_bike	2022-01-25 04:53:40	2022-01-25 04:58:01	Sheffield Ave & Fullerton Ave	TA1306000016	Greenview Ave & Fullerton Ave	TA1307000001	41.92560	-87.65371	41.92533	-87.66580	member
CBB80ED419105406	classic_bike	2022-01-04 00:18:04	2022-01-04 00:33:00	Clark St & Bryn Mawr Ave	KA1504000151	Paulina St & Montrose Ave	TA1309000021	41.98359	-87.66915	41.96151	-87.67139	casual
DDC963BFDDA51EEA	classic_bike	2022-01-20 01:31:10	2022-01-20 01:37:12	Michigan Ave & Jackson Blvd	TA1309000002	State St & Randolph St	TA1305000029	41.87785	-87.62408	41.88462	-87.62783	member

Following code chunks will be used for this ‘Process’ phase for bike_df.

# checking merged data frame
colnames(bike_df)  #List of column names

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

head(bike_df)  #See the first 6 rows of data frame.  Also tail(bike_data)

## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹rideable_type,
## #   ²start_station_name, ³start_station_id

str(bike_df)  #See list of columns and data types (numeric, character, etc)

## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(bike_df)  #Statistical summary of data. Mainly for numeric.

##    ride_id          rideable_type        started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5667717    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00                     
##  NA's   :5858    NA's   :5858

## Adding date, month, year, day of week columns
bike_df <- bike_df %>% 
  mutate(year = format(as.Date(started_at), "%Y")) %>% # extract year
  mutate(month = format(as.Date(started_at), "%B")) %>% #extract month
  mutate(date = format(as.Date(started_at), "%d")) %>% # extract date
  mutate(day_of_week = format(as.Date(started_at), "%A")) %>% # extract day of week
  mutate(ride_length = difftime(ended_at, started_at)) %>% 
  mutate(start_time = strftime(started_at, "%H"))

# converting 'ride_length' to numeric for calculation on data

bike_df <- bike_df %>% 
  mutate(ride_length = as.numeric(ride_length))
is.numeric(bike_df$ride_length) # to check it is right format

## [1] TRUE

# adding ride distance in km
library(geosphere)
bike_df$ride_distance <- distGeo(matrix(c(bike_df$start_lng, bike_df$start_lat), ncol = 2), matrix(c(bike_df$end_lng, bike_df$end_lat), ncol = 2))

bike_df$ride_distance <- bike_df$ride_distance/1000 #distance in km

# Clean data
# and checked for quality by Divvy where ride_length was negative or 'zero'
bike_df_clean <- bike_df[!(bike_df$ride_length <= 0),]

0.1.6 Analyze

Analyzing data involves the use of statistical and computational techniques to extract insights and knowledge from data. The goal of data analysis is to identify patterns, trends, relationships, and anomalies in the data that can inform decision-making and drive business outcomes.

# Show summary data
summary(bike_df_clean)

##    ride_id          rideable_type        started_at                    
##  Length:5667186     Length:5667186     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:20:00.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:01:59.50  
##                                        Mean   :2022-07-20 07:19:14.76  
##                                        3rd Qu.:2022-09-16 07:18:50.75  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667186     Length:5667186    
##  1st Qu.:2022-05-28 19:41:54.25   Class :character   Class :character  
##  Median :2022-07-22 15:22:49.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:38:41.61                                        
##  3rd Qu.:2022-09-16 07:36:10.75                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667186     Length:5667186     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual          year          
##  Min.   : 0.00   Min.   :-88.14   Length:5667186     Length:5667186    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                                        
##  3rd Qu.:41.93   3rd Qu.:-87.63                                        
##  Max.   :42.37   Max.   :  0.00                                        
##  NA's   :5858    NA's   :5858                                          
##     month               date           day_of_week         ride_length     
##  Length:5667186     Length:5667186     Length:5667186     Min.   :      1  
##  Class :character   Class :character   Class :character   1st Qu.:    349  
##  Mode  :character   Mode  :character   Mode  :character   Median :    617  
##                                                           Mean   :   1167  
##                                                           3rd Qu.:   1108  
##                                                           Max.   :2483235  
##                                                                            
##   start_time        ride_distance     
##  Length:5667186     Min.   :   0.000  
##  Class :character   1st Qu.:   0.873  
##  Mode  :character   Median :   1.575  
##                     Mean   :   2.140  
##                     3rd Qu.:   2.781  
##                     Max.   :9817.319  
##                     NA's   :5858

Conduct descriptive analysis. Run a few calculations in one file to get a better sense of the data layout. Options:

Calculate the mean of ride_length
Calculate the max ride_length
Calculate the mode of day_of_week

bike_df_clean %>% 
  summarise(average_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length)) %>% gt()

average_ride_length	median_length	max_ride_length
1166.846	617	2483235

Calculate the average ride_length for members and casual riders. Try rows = member_casual; Values = Average of ride_length.

bike_df_clean %>% 
  group_by(member_casual) %>% 
  summarise(rides = length(ride_id),
            ride_pct = (length(ride_id) / nrow(bike_df_clean)) * 100) %>% 
  gt()

member_casual	rides	ride_pct
casual	2321769	40.96864
member	3345417	59.03136

bike_df_clean %>% 
  group_by(rows = member_casual) %>% 
  summarise(Values = mean(ride_length)) %>% 
  ggplot(aes(x = rows, y = Values, fill = rows)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(title = "Number of trips",
       y = "Number of rides",
       x = "Members types",
       subtitle = "Number of trips made by Casual and members")+
  theme_minimal()

For the year 2022 we see that the number of occasional members have made the longest journeys compared to permanent members, i.e. respective percentages of 59% against 40.9%

Calculate the average ride_length for users by day_of_week. Try columns = day_of_week; Rows = member_casual; Values = Average of ride_length.

bike_df_clean %>% 
    group_by(columns = day_of_week, Rows = member_casual) %>% 
    summarise(Values = mean(ride_length), .groups='drop') %>% 
  arrange(Rows, columns)  %>% 
  ggplot(mapping = aes(x = columns, y = Values, fill = Rows)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_fill_viridis_d() +
  labs(title = "Average of Ride Length", 
       subtitle = "the average ride length for users by day of week",
       x = "Days of Week",
       y = "Ride Length Average") +
  theme_minimal()

Looking across weekdays, occasional users have the longest journey lengths on average.

Calculate the number of rides for users by day_of_week by adding Count of trip_id to Values.

bike_df_clean %>% 
    group_by(columns = day_of_week) %>% 
    summarise(Values = length(ride_id)) %>% 
  ggplot(mapping = aes(x = reorder(columns, -Values), y = Values,
                       fill = columns)) +
  scale_fill_viridis_d() +
  geom_col(show.legend = FALSE) +
  #theme(legend.position="none") +
  labs(title = "Nomber of Rides",
       subtitle = "number of rides for users by day of week",
       x = "Day of Week",
       y = "Number of Rides",
       caption = "Cyclistic trip data") +
  theme_minimal()

The use of bicycles is more important during the weekend, the days of Saturday register the greatest number of users.

bike_df_clean %>% 
    group_by(Days = day_of_week, members = member_casual) %>% 
    summarise(Values = length(rideable_type),.groups='drop') %>% 
  ggplot(mapping = aes(x = Days , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rideable Type",
       subtitle = "number and type of bicycles used per user per day",
       x = "Days",
       y = "Number of used bicylcle",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

On the other hand, if we focus on the type of gear used in the field, we can clearly see that permanent users use bicycles more than occasional users.

Calculate the type of bike used by the different users

bike_df_clean %>% 
  group_by(Days = day_of_week, members = member_casual, type = rideable_type) %>% 
  summarise(Values = length(rideable_type),.groups='drop') %>% 
  ggplot(mapping = aes(x = type , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rideable Type",
       subtitle = "number and type of bicycles used per user",
       x = "Types of Bicycles",
       y = "Number of Bicylcle Type",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

The casual use more type of bike three types of bike than members two types of bike.

NOTICE!

The types called Docked Bike are only used by Casual users.

# Reordered month correctly

bike_df_clean$month <- ordered(bike_df_clean$month, 
                            levels=c("janvier", "février", "mars", "avril", "mai", "juin", "juillet", "août", "septembre", "octobre", "novembre", "décembre"))

bike_df_clean %>% 
    group_by(month = month, members = member_casual) %>% 
    summarise(Values = mean(ride_length),.groups='drop') %>% 
  ggplot(mapping = aes(x = month , y = Values,
                       fill = members)) +
  scale_fill_viridis_d() +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  #theme(legend.position="none") +
  labs(title = "Rides Averages",
       subtitle = "average of rides for member and casual by month",
       x = "Months",
       y = "Rides Averages",
       caption = "Cyclistic trip data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

As in weekdays, occasional users also have a greater presence during all months of the year.

Create Maps for Geographic Visualization

With the size of the dataset, we are going to map the number of casual and members in the month of march to make a comparison with ride length and the difference between the two groups of users by the DuSable Lake Shore Dr & Monroe St station start_station_name

For this analysis we are going to use mapview packages. Mapview is an R package that provides an interactive and easy-to-use interface for visualizing spatial data on interactive maps. The package is built on top of the leaflet JavaScript library, which allows for the creation of interactive and customizable web maps.

# Loading the two geospatial packages to map data
library(mapview)
library(leafsync)

Use filter function to extract the data based on casual users the month the type of bicycle and the start station name

casual_electric_SDGA <- bike_df_clean %>%
  filter(member_casual == "casual" & month == "mars" & rideable_type == "electric_bike" & 
           start_station_name == "DuSable Lake Shore Dr & Monroe St") %>% 
   select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)

Use filter function to extract the data based on members users the month the type of bicycle and the start station name

member_electric_SDGA <- bike_df_clean %>%
  filter(member_casual == "member" & month == "mars" & rideable_type == "electric_bike" & 
           start_station_name == "DuSable Lake Shore Dr & Monroe St") %>% 
  select(member_casual, Longitude = start_lng, Latitude = start_lat, rideable_type, ride_length)

casual <- mapview(casual_electric_SDGA, xcol = "Longitude",
        ycol = "Latitude",
        crs = 4326, grid = FALSE,cex = "ride_length",
        zcol = "ride_length",
        #col.regions = "tomato",
        layer.name = "Casual",
        zoom = 19,
        use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))


members <- mapview(member_electric_SDGA, xcol = "Longitude",
        ycol = "Latitude",
        crs = 4326, grid = FALSE,cex = "ride_length",
        zcol = "ride_length",
        #col.regions = "tomato",
        layer.name = "Members",
        zoom = 19,
        use.layer.names = mapviewOptions(platform = "leaflet","CartoDB.DarkMatter"))

sync(casual, members)

The analysis that we have making on the month of march show clearly that the casual users is most important than the members’ users. The two maps are synchronous to show this analytical thinking.

Now it’s time to share with Stackholders.

0.1.8 Acte

After the analysis and Sharing, the finding of this work are:

occasional users seem to enjoy adventures more than permanent members, perhaps due to a desire to cycle.
At the level of every month we see a fairly large number of occasional users than permanent users. We even see that occasional users constitute 2/3 of the total users.
The length of the linears made by occasional users are greater than permanent users.

For the recommendations:

I think that doing advertising campaigns could add value in order to increase the number of permanent members.
Try to do as much promotion as possible to encourage occasional users to join as permanent members.
During the high intensity months (March, April, May, June) make a good impression in terms of customer management in order to attract more members.

0.2 Conclusion

This is an interesting Case Study based on the Bicycle dataset. For this work, tools such as R programming language, Spreadsheets, and Tableau Public Software for Sharing our findings. Thanks a lot.

NOTICE!

The dashboard can viewing using this URL Portfolio;Tableau_Public

Google Data Analytics: Case Study 1

Abdoulaye Leye (GiS, Data Analyst Specialist)

2023-02-28

0.1 Introduction:

0.1.1 Scenario

0.1.2 About the company

0.1.3 Ask

0.1.4 Prepare

0.1.5 Process

0.1.6 Analyze

0.1.8 Acte

0.2 Conclusion

Google Data Analytics: Case Study 1

Abdoulaye Leye (GiS, Data Analyst Specialist)

2023-02-28

0.1 Introduction:

0.1.1 Scenario

0.1.2 About the company

0.1.3 Ask

0.1.4 Prepare

0.1.5 Process

0.1.6 Analyze

0.1.7 Share

0.1.8 Acte

0.2 Conclusion