Google Analytics Certificate - Capstone Project

Project Repository: www.github.com/ahmedmostafa23/bellabeat

By: Ahmed Mostafa, Data Analyst.

My GitHub: www.github.com/ahmedmostafa23
My Kaggle
My LinkedIn

1-Ask and Plan

Who are BellaBeat?

Bellabeat is a manufactuer of smart health products for women. It is a global company since 2016. They own several physical stores, their own e-commerce channel, website and online retailers (amazon and such).
BellaBeat manufactures several products. The products are ordinary looking jewelry, you won’t realize they’re (too) smart!

Owners of BellaBeat devices enjoy the BellaBeat app: an app that receives data about vitals from smart devices about activity, sleep, stress, menstruation and the users’ habits.
BellaBeat membership subscription: 24/7 access to personalized guidance on the above problems, beauty and goals.

BellaBeat Marketing Strategy: - traditional (radio, billboards, print e.g. magazines, television), but they also extensively focus on digital

To learn more about BellaBeat, you can visit their website.

The Scope of Work

My Role

Analysis task: I will be analyzing data from customer’s use of the products.

-i.e. data collected by the device itself from customers, and sent to us.

Stakeholders: - My team and the Executive Team

Deliverables: presentation and recommendations


The Analysis Problem

Problem: explore the use data of one of our health products and app to find insights, patterns and trends about the users’ usage to be used in making high-level marketing recommendations.
The dataset is data collected by the app/device.

Potential Questions that need to be answered:

Hypothesis: they use the product the most for running and tracking their menstrual period or pregnancy.

Required characteristics of the dataset:

  1. Population: there have been 100000 sold Bellabeat Leaf units from 2013 to 2017. I’m going to assume 100000 unique subjects.
    with a margin of error of 10%, a confidence level of 10% I should have a sample size of at least 68 users.
  2. Detailed non aggregate daily data of the usage, with intervals.
  3. Rating about features of the product, or the product as a whole.
  4. Also includes male users if possible.
  5. Women of different ages, race, nationality, and physically or mentally challenged women, as well as women of several income brackets
  6. preferably the data needs to be up to date as possible, with COVID-19 happening and all.

2-Prepare

1. Data Collection

2.Meta-Data

3. Software used

4. Checking if dataset can still answer original questions.

5. Data Ethics, Bias and Privacy


3-Process

The data has been backed up to GitHub in the latest commit before I start the process stage and start cleaning the dataset.

1.Data Integrity:

2. Cleaning:

The tables will now be explored. the following will be done to each table:

  1. Constraints will be added to the tables and columns (enumeration, set membership, cross column, range, not NULL etc.)
  2. The tables will be checked for duplicate rows. if any are found they will be removed
  3. The tables will be checked for NULL values. if any are found they will be dealt with accordingly.
  4. Some columns maybe combined into a single column, other columns maybe processed or divided into several other columns
  5. String types will be checked for excessive whitespace, and removed if any.

Let’s get started with each table!

3. Backup


4-Analyze

In this phase, 3 types of analysis will be done:

  1. Exploratory analysis:
    • Which categories/users/date/time have very low count compared to others, what am I going to do with them?
    • Summary of each column (aggregate: min, max, sum, count, unique count, percentiles and outliers)
    • Which categories/users have very low count compared to others, what am I going to do with them?
    • Investigate any anomalies/outliers found.
    • do I need to divide the dataset into subsets? e.g. active vs. inactive? anomalies and outliers?
  2. Checking for correlations
    • sleep vs calories or met or intensity
    • calories vs steps or intensity or met
  3. Answering the questions using data

1. Exploratory Analysis

  1. Investigating correlations between MET, intensity, steps and calories

    • calories/min depend on the level of activity, weight, height, age, pregnancy and etc. and thus varies greatly from person to the other, so calories as a number is not good metric for this investigation. however, we maybe able to find the baseline calorie for each user (at MET=1, but MET IS the ratio ratio, so MET will be used instead)
    • One thing should be for sure: at MET =1, steps should be =0, intensity =0 and calories = baseline. records where that is not satisfied (below <500 rows) have been removed from the analysis.
    • The relation between met and percentage baseline is linear, but unit stepped with a few “intersections”. i.e. each MET range covers a range of percentage baseline calories. which seems to tell that MET is by definition, the baseline calories ratio. and that MET is approximated to be whole numbers by the device.
    • The relation between cadence and baseline calories seems to be linear but is clustered heavily around the trendline. indicating that the percentage baseline of calories depends on factors other than just walking, but perhaps from user to user or other kinds of activity. i.e. the device users don’t just walk or run.
    • For intensity, there seems to be something quite strange. people having baseline calories of 10+ have an intensity of just 1! what should happen is that MET >3 should be at Intensity > 1! this maybe an indication that the device is measuring incorrectly.
      MET y-axis vs blc x-axis
  2. DailyActivity

    • Only records with active minutes > 0 and active distance > 0 will be studied. a new column called average speed will be created which is distance/time in km/h
    • light ones: 855 records are not zero. no time but no distance. in fact, 99th percentile is 1.86 km/h. thus I will let moderate be anything > 1.86km/h
    • moderate ones: 453 records have speed of > 1.86. with the 99th percentile at 4.2 km/h. Thus active will be defined as anything > 4.2 km/h
    • The interesting finding here is that different users and minutes can have the same level of activity but vary greatly in speed. which means that either the device is wrong, or that not all users have running or walking as their activity. Unfortunately, there is no other way to know what their activities are.
  3. Answering the Questions!

Pie chart of those percentages

stat very active minutes moderately active minutes
avg 35.1 22.7
25th 9.3 9
50th 27 16
75th 50.8 29

5-Act

Now that the analysis is complete, it is time to make recommendations that will drive data driven decisions for BellaBeat’s marketting strategy

  1. Investigate why users do not sleep with the device, or add some features that encourage sleeping with it. A quick google search shows that smart devices emit radiation, and is why users od not sleep with it. that could be the cause!

  2. Because we don’t know when users exactly fall asleep, Add a feature to the app or device e.g. similar to Netflix’s “are you still watching?”, or prompt them for their usual sleep schedules and concentrate on that time.

  3. Ask the users on the device or app to enter what the activity they are doing is, or what is their job or the nature of their job. if the users do not respond to these, a point system could be set up to entice them to input their information.

  4. Have the device push them to their goals on the weekend, or provide a fun alternative that they can enjoy while resting. provide them with a weekly summary so they can feel good about themselves. The company can also add features to the device that sends a message on a weekend e.g. “it’s the weekend! enjoy getting uninterrupted sleep” or something similar.

  5. Sedentary jobs are known to have adverse health effects because of sitting for too long or staring at the screen. it could be nice to add some features to remind the user to take a break or stare away from the screen or get up and stretch or etc.

  6. Add features or messages to encourage them to use the device on the weekend, i.e. to help them relax.

  7. Do not let the device approximate MET. it produces misleading data.

  8. Have HRV be the main focus of the device/app, and make that the main marketting feature. because HRV is truly the only metric that the device can use to predict whether a user is acutally healthy.


Appendix

Here you will find all of the PostgreSQL queries that were written to answer the questions or retrieve particular data

SQL syntax for removing whole duplicate records from any table “table_name”

 WITH everything AS 
    DELETE FROM table_name
    RETURNING * 
 ) 
 INSERT INTO table_name
 SELECT DISTINCT * 
 FROM everything;

SQL syntax for checking for any cells in any table "table_name" that have NULL values
 SELECT *
 FROM table_name
 WHERE
	 NOT (table_name IS NOT NULL);

SQL Syntax for summarizing a numerical column "col_name" in a table "table_name":
SELECT
    MIN(col_name),
    MAX(col_name),
    AVG(col_name),
    COUNT(*),
    PERCENTILE_DISC(0.01) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.05) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.1) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY col_name),
    PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY col_name)
    FROM table_name
    WHERE
  	  col_name IS NOT NULL 

SQL Syntax for summarizing a date/time column:
SELECT
	 MIN(col_name),
	 MAX(col_name),
	 COUNT(*)
FROM table_name
WHERE
	 col_name IS NOT NULL;

SQL Syntax for summarizing a categorical column:
SELECT
	col_name,
	COUNT(*)
FROM table_name
GROUP BY
    col_name

Question 1:
--Question 1 
 SELECT
     CAST(AVG(very_active_minutes) AS INT) AS "very",  
     CAST(AVG(moderately_active_minutes) AS INT) AS "moderate", 
     CAST(AVG(lightly_active_minutes) AS INT) AS "light",        CAST(AVG(sedentary_active_minutes) AS INT) AS "inactive"  
 FROM DailyActivity;

Question 2 and 4, count of number of records per user:

WITH Minute AS (
     SELECT *
     FROM MinuteCalories
     INNER JOIN MinuteMET USING(device_id, date, time)
     INNER JOIN MinuteIntensity USING(device_id, date, time)
     INNER JOIN MinuteSteps USING(device_id, date, time)
     LEFT JOIN MinuteSleep USING(device_id, date, time)
)
SELECT
	device_id,
	COUNT(*)
FROM Minute
GROUP BY
	device_id 

Question 2 and 4, count of number of records by day

WITH Minute AS (
    SELECT *
    FROM MinuteCalories
    INNER JOIN MinuteMET USING(device_id, date, time)
    INNER JOIN MinuteIntensity USING(device_id, date, time)
    INNER JOIN MinuteSteps USING(device_id, date, time)
    LEFT JOIN MinuteSleep USING(device_id, date, time)
)
SELECT
	date,
	COUNT(*)
FROM Minute
GROUP BY
	date
ORDER BY
	date ASC; 

Question 5:
/*This is a file that fetches records for users with both active minutes > 0, and active distance > 0 for all days of the study, and calculates their speed. it should be filtered so that it only produces runners. for moderately active speed is (1.86-4.2)km/h, for very active it is > 4.2km/h*/
SELECT
    lightly_active_distance,
    lightly_active_minutes,
    CAST((lightly_active_distance*60/lightly_active_minutes) AS NUMERIC(3,1)) AS "lightly_active_speed"
FROM DailyActivity
WHERE
	lightly_active_distance > 0
	AND lightly_active_minutes > 0
ORDER BY
	1 ASC; 

Question 6:
WITH day_avg AS (
    SELECT 
	    device_id,
	    CAST(AVG(HRV) AS INT) AS "day_avg_hrv"
	FROM SecondHRV
	WHERE
		time BETWEEN '07:00:00' AND '22:00:00'
	GROUP BY
		device_id
), night_avg AS (
	SELECT
		device_id, 
		CAST(AVG(HRV) AS INT) AS "night_avg_hrv"
	 FROM SecondHRV
	 WHERE
		 time > '22:00:00' OR time < '07:00:00'        
	GROUP BY  
	     device_id 
) 
SELECT
	d.device_id,
	day_avg_hrv,
	night_avg_hrv
FROM day_avg AS "d" 
FULL OUTER JOIN night_avg AS "n" USING(device_id) 

Questions 7&10 (the weekend), Helping function:
/*The function EXTRACT(DOW FROM date) returns the days of the week as a double precision numerical type such that Sunday = 0, and Saturday = 6. I have created the following function to take that number, and return the name of the day of the week for me, so that my life becomes easier.*/
CREATE OR REPLACE FUNCTION fn_downumber2name(dow double precision) RETURNS TEXT AS
$$
    SELECT (
	         CASE
	         WHEN dow = 0 THEN 'Sun'
	         WHEN dow = 1 THEN 'Mon'
	         WHEN dow = 2 THEN 'Tue'
	         WHEN dow = 3 THEN 'Wed'
	         WHEN dow = 4 THEN 'Thu'
	         WHEN dow = 5 THEN 'Fri'
	         ELSE 'Sat'
	         END
	        )
 $$ LANGUAGE SQL

Questions 7&10:
 /*This SQL code will group by days of the week using the above function.*/
 SELECT
     fn_downumber2Name(EXTRACT(dOW FROM date)) AS "day_of_week",
     COUNT(*) 
 FROM DailyACtivity
 GROUP BY
	 fn_downumber2name(EXTRACT(dOW FROM date))

Question 8:
 /*This file will compare the MET at work hours against the avg MET outside work hours (9:00 am to 5:00pm)*/
  WITH Minute AS (
	  SELECT *
	  FROM MinuteCalories
	  INNER JOIN MinuteMET USING(device_id, date, time)
	  INNER JOIN MinuteIntensity USING(device_id, date, time)
	  INNER JOIN MinuteSteps USING(device_id, date, time)
	  LEFT JOIN MinuteSleep USING(device_id, date, time)
 )
 SELECT
	 time,
	 CAST(AVG(MET) AS NUMERIC(3,1)) AS "work_met"
 FROM Minute
 WHERE
	 time BETWEEN '09:00:00' AND '17:00:00'
 GROUP BY
	 time
 ORDER BY
	 1 ASC; 

Question 11:
 /*This file will give me a count of sleep records by time of day for all users in all days*/
 SELECT
     time,
     COUNT(*)
 FROM
	 MinuteSleep
 WHERE
	 EXTRACT(SECOND FROM time) = 0
 GROUP BY
	 time
 ORDER BY
	 time ASC;