Bellabeat Case Study
For Google Data Analytics Certification
Bellabeat case study is the final project in the Google Data Analytics Professional Certification course.
How can a Wellness Technology Company Play it Smart?
About the Company: Bellabeat
Bellabeat is a wellness brand company, based in San Francisco, California, that has created a number of smart wearable products designed specifically for women. Their ecosystem of smart wearables devices that monitor biometric and lifestyle data that can be used as a guide to help women know more about the outcomes of their activities, physically and otherwise.
- Bellabeat app- provide users with health data related to their activity
- Leaf Urban – a wellness tracker that can be worn as a bracelet, necklace, or clip
- Time – smart technology tracker, activity, sleep,stress, etc.
- Spring – water bottle that tracks daily water intake
- Bellabeat membership – 24/7 subscription-based progam for users
Bellabeat is looking for undiscovered opportunities in the global smart device market. It’s co-founder Urška Sršen would like for me, a new Bellabeat junior data analyst employee, to analyze a public dataset collected by FitBit Fitness Tracker Data – CC0:Public Domain (made available through [Mobius]). My job is to conduct a market analysis of current trends and use the insights discovered within the FitBit data to help guide the marketing strategy for one of Bellabeat products.
I have selected to use my analysis for Leaf Urban. Leaf Urban is a piece of jewelry worn on the wrist or as a necklace. It is one of Bellabeat’s most popular wellness tracker. Leaf Urban doubles as a wellness and lifestyle (mediation) tracker that focuses on a users wellness goals. More specifically sleeping habits, daily activities, menstrual cycle, stress, sedentary behaviors and more.
Analyze existing customer public dataset from FitBit Tracker Fitness to identify potential new growth opportunities and present recommendations for Bellabeat’s marketing strategy for Leaf Urban (bracelet or necklace).
Questions for the analysis
- What are some of the trends discovered in smart device usage?
- How could these trends be applied to Bellabeat customers?
- How could these trends help influence Bellabeat’s marketing strategy?
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
Prepare and Load packages
The dataset contains personal fitness tracking data from 30 **fitbit** users. It is a Public data set from FitBit Fitness Tracker Data – CC0:Public Domain, dataset made available through [Mobius], that you can find on Kaggle. It includes 18 data sets information about daily activity, steps, and heart rate that can be used to explore user’s habits.
About the Dataset
This dataset for this case study was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
The data set can be downloaded from Kaggle website.
Hypothesis: My reason for selecting to analyze only eight of the 18 datasets for this study is because I am interested in discovering potential opportunities between a user daily activities, including steps and sleep. And how wearing Bellabeat Leaf Urban can help users make better lifestyles choices (food, meditation, decrease stress). I will be looking at the public data from FitBit to find trends of users how who are getting little sleep and who have a sedentary life style can affect their daily activity output.
Dataset tables were renamed
- dailyActivity_merged.csv -> Daily_Activity_df
- dailyCalories_merged.csv -> Daily_Calories_df
- dailyIntensities_merged.csv -> Daily_Intensities_df
- dailySteps_merged.csv -> Daily_Steps_df
- hourlySteps_merged.csv -> Hourly_Steps_df
- minuteSleep_merged.csv -> Minute_Sleep_df
- sleepDay_merged.csv -> Sleep_Day_df
- weight_loginfo_merged.csv -> Weight_Log_Info_df
The limitations for this public dataset are:
- old data (March 14, 2016 – May 12, 2016)
- small sample size
- the data is most likely not representative of all eligible FitBit users with only 30 participants
- missing user characteristics (gender, age, health, lifestyle, location, employment status, etc.)
- missing useful data from users as many did not record any sleep data; possibly removed their fitbit device when sleeping
- users were not asked to be aware of when they are using their Fitbit and ot use it for the entire 30 day testing period.
Cleaning the data: What I am looking for
Google Sheets and RStudio (desktop) was used for this case study
Because of the small data sample size I used both Google sheets and R Studio to complete this analysis and also for data visualization.
My Data Cleaning Approach
- identifying variable types
- using the select and filter function to find missing data
- find and eal with missing data and also duplicates
Dates and times needed to be reformatted from ‘text’ to ‘date’
Data formats were inconsistent improved
Duplicate data removed
Some data values were reduced to three decimals places
Note: Variable names were not changed to lowercase because of the long name title. clean_name() function was used instead to automatically make sure that the column names are unique and consistent.
RStudio – Desktop
Installing and loading of packages:
(to use the glimpse() pillar package must first be installed)
I first viewed the csv files in Google sheets to make formatting changes to the dates and time. I imported the files into RStudio (desktop) and created the data frames.
Installing and loading of packages:
Because I am working with RStudio desktop, I imported the csv files directly from my computer directory by first ‘setting the working directory’.
[Workspace loaded from ~/.RData]
Read cvs files
> setwd(“~/Desktop/Google Data Analyst Cerfs/Fitbit for Bellabeat-July1-2022”)
> Daily_Activity_df <- read.csv(“dailyActivity.csv”)
> Daily_Calories_df <- read.csv(“dailyCalories.csv”)
> Daily_Intensities_df <- read.csv(“dailyIntensities.csv”)
> Daily_Steps_df <- read.csv(“dailySteps.csv”)
> Hourly_Steps_df <- read.csv(“hourlySteps.csv”)
> Minute_Sleep_df <- read.csv(“minuteSleep.csv”)
> Sleep_Day_df <- read.csv(“sleepDay.csv”)
> Weight_Log_Info_df <- read.csv(“weightLogInfo.csv”)
Previewed the datasets to determine how to proceed with cleaning and merging for analysis.
During this process I used the head(), colnames() and view () functions to view more details of the frames and find some commonalities that could be used for further exploration.
Removing Data frames
Each of the selected data frames contain the same user ‘Id’ column and this will allows me to merge the different frames. Also, you can see that the data frame for Daily_Activity_df also includes the same data as in the Daily_Calories_df, Daily_Intensities_df and Daily_Steps_df. They all have the same number of observations, 940 and will therefore be removed from the analysis going further. We can use the data within the Daily_Activity_df.
Cleaning/formatting the dataset
During this process I checked for duplicates, looked for unique users and inconsistent formatting. The “date” entry will need to be formatted into the correct data type.
#Remove duplicated entries from Sleep_Day_df and Minute_Sleep_df
#Verify that the data was removed
#No duplicates found in the dataframes for DailyActivity_df, Hourly_Steps_df and Weight_Log_info_df
> Minute_Sleep_df %>% duplicated() %>% sum()
> Minute_Sleep_df <- Minute_Sleep_df[!duplicated(Minute_Sleep_df),] > Minute_Sleep_df %>% duplicated() %>% sum()
> Sleep_Day_df %>% duplicated() %>% sum()
> Sleep_Day_df <- Sleep_Day_df [!duplicated(Sleep_Day_df),] > Sleep_Day_df %>% duplicated() %>% sum()
The glimpse () were used to view key statistics about the dataframes.
Summarizing the data frames
I used the functions n_distinct() and nrow () to determine the number of unique values and the number of rows in each data frame.
#n_distinct to look for unique users Id
#n_rows to look for unique rows
#Rename some columns in the dataframes from ActivityDate/ActivityDay to Date or ActivityHour to Hour
> Daily_Activity_df <- DailyActivity_df %>%
rename(Date = ActivityDate)
> Hourly_Steps_df <- Hourly_Steps_df %>% rename(Hour = ActivityHour)
> Sleep_Day_df <- Sleep_Day_df %>%
rename(Date = SleepDay)
Organizing the data frames
I used the summarize() to get a higher level of information about our data.
Taking a closer look at the summary() and glimpse() functions you can see that it shows
- that the average FitBit user took 7638 steps a day.
- data is missing the age of Fitbit users I can not tell if the average daily steps are within recommended age range of the user. I can only say that the users are active walkers but not a highly active walker.
- an average of 21.16 of ‘Very Active Minutes’
- 991. 2 minutes of ‘Sedentary Minutes’ equivalent to 16.52 hours of sitting or sleeping
- an important data point to consider in that it is significant in comparison to the number of minutes ‘Very Active Minutes’
- the average calories (burned) is 230.4 with one max user burining up to 490.0 in one day.
- the average ‘Total Minutes Asleep’ 419.5 or 7.002 hours
- the average for the ‘Total Time in Bed’ is 458.6 minutes (7.64 hours)
- the average weight in kg is 72.04 or 158.8 pounds
- average fat perecent is 23.50, data is only from 2 users
- average BMI is 25.19
- because this data is from Fitbit wearable and not from a dignostic tool is should not be considered reliable data for analysis
To create a visualization of the data analysis, ggplot2 () function was used to further expalin patterns and trends.
ggplot(data = Daily_Activity_df) + geom_smooth(mapping = aes(x=VeryActiveMinutes, y=Calories))+
geom_point(mapping = aes(x=VeryActiveMinutes, y=Calories))
loeruams aonaf afaonafd a
“Fusce imperdiet imperdiet massa eu posuere risus ullamcorper vitae. Praesent viverra odio quis gravida nunc ut rutrum erat congue dui vitae.”