Thoughts

Today, I wanted to advance in my learning on becoming a Data Scientist through Datacamp’s Associate Data Science Career Track but I was distracted by other things. One of them was whether I understand Hypothesis Testing.

At first, I continued my learning progress on Intermediate R and focused on how to write custom functions, use lapply() and sapply() to execute functions. A function to execute other function(s). Funct-ception! But then, at the middle of my learning session, I remembered that I haven’t practiced what I have learned in the Hypothesis Testing part of the Data Analyst in R Career Track. So, I paused my learning on the Intermediate R and switched to exercise.

When I opened my project, I was petrified. I couldn’t figure out what to do to perform hypothesis testing. My mind was blocked as if a mist filled my thoughts. “What do I do?”

I reviewed my project. I reread what I had written.

Little by little, I regained what I was trying to achieve from working on this project. I wanted to find out whether Group A has higher mean score than Group B. But then I stumbled upon another question:

“So, how do I find it out? Surely, I can do it by just aggregating the variable from the whole dataset since I have a dataset of the full population, but that wouldn’t be always available in the future. So I have to use this hypothesis testing of a sample of a population.”

I spent an hour figuring this out. I avoided using AI like Gemini or ChatGPT because I want to figure it out myself. But then I gave up. In the end, I asked Gemini for an assistance. However, instead of making progress on the working project, I ended up asking Gemini to train me in performing hypothesis tests. Guess what? My head was overloaded. The exercises were simple enough (see below excerpt) but it took a toll on my head! Even for a very basic hypothesis test, I took 15 minutes completing that! What a shame!

See chat

Don't be too hard on yourself! Hypothesis testing is one of those concepts that feels intuitive until you're staring at a p-value and a messy dataset. It’s a bit like learning to drive—you can read the manual all day, but you don't really "get it" until you're behind the wheel.

In R, the syntax is often straightforward, but the setup (knowing which test to use and why) is where the real skill lies. Here are three challenges ranging from "warm-up" to "real-world" to help you sharpen those skills.

Challenge 1: The "Classic" Two-Sample T-Test

Scenario: You are analyzing the fuel efficiency of two types of cars: 4-cylinder and 6-cylinder. You want to know if there is a statistically significant difference in their miles per gallon (mpg).

Dataset: Use the built-in mtcars dataset.
Tasks:
1. Filter the data to include only 4 and 6 cylinder cars (cyl).
2. Check the assumption of normality for both groups using a Shapiro-Wilk test (shapiro.test).
3. Perform an Independent Two-Sample T-Test to see if mpg differs by cyl.
Goal: State your null hypothesis ($H_0$) and whether you reject it at $\alpha = 0.05$.

Challenge 2: Proportions and the A/B Test

Scenario: An e-commerce site is testing two different "Buy Now" button colors.

Group A (Red Button): 1,000 visitors, 50 clicks.
Group B (Green Button): 1,100 visitors, 72 clicks.

Tasks:
1. Use the prop.test() function in R.
2. Calculate the 95% confidence interval for the difference in proportions.
Goal: Determine if the Green button actually performs better or if the result is likely due to random chance.

Challenge 3: Analysis of Variance (ANOVA)

Scenario: You are testing three different fertilizers to see if they result in different plant heights.

Dataset: Create a dummy frame or use PlantGrowth.
Tasks:
1. Visualize the data using a boxplot to see the distribution.
2. Run a One-Way ANOVA using aov().
3. If the ANOVA is significant, perform a Tukey HSD post-hoc test (TukeyHSD()) to find out which specific fertilizers differ from one another.
Goal: Explain why we use ANOVA instead of just doing three separate t-tests.

Pro-Tips for R Testing

The P-Value Logic: If $p < 0.05$, the result is "surprising" enough under the null hypothesis that we ditch the null.
Formula Notation: R loves the y ~ x (Response ~ Predictor) syntax. For the t-test, it looks like t.test(mpg ~ cyl, data = my_data).
Assumptions Matter: Always check if your data is normal and if the variances are equal (var.equal = TRUE or FALSE) before trusting your t-test.

From this experience, I question myself whether I truly deserve the Data Analyst in R completion certificate. I need to improve. I need to learn more. I need to get my hands dirty into hypothesis testing. I need more practive. I need to overcome this shortcoming. I want to be a Data Scientist.

Today’s Exercise

# Task 1: Is there a difference between cars with 4 and 6 cyls?
alpha <- .05

## filter only cyl 4 and 6
cyl46 <- mtcars %>% 
  filter(cyl %in% c(6,4)) # filter cyl only 4 and 6

## normality test of cyl = 6
cyl46 %>% filter(cyl == 6) %>% 
  pull(mpg) %>% 
  shapiro.test()

## normality test of cyl = 4
cyl46 %>% filter(cyl == 4) %>% 
  pull() %>% 
  shapiro.test()

## Independent Two-Sample T-Test
### base package T-Test
t.test(mpg ~ cyl, data = cyl46, alternative = "two.sided", var.equal = TRUE)

### infer package T-Test
cyl46 %>% 
  t_test(mpg ~ cyl, alternative = "two-sided")

# Verdict: There is statistical difference in mpg between cyl = 4 and cyl = 6

# task 2: Is Green button better than Red button
# declare variables
visitor <- c(1000, 1100)
clicks <- c(50,72)
groups <- c("Red Button", "Green Button")
ex2 <- data.frame(groups, visitor, clicks)

# proportion test
ex2 <- ex2 %>% mutate(prop = clicks/visitor)

prop.test(clicks, visitor, alternative = "less", conf.level = .95)

# Verdict: Evidence not enough to tell Green button performs better than Red button

# task 3: How each fertilizer perform?

## Boxplot
PlantGrowth %>% 
  ggplot(aes(x = weight)) +
  geom_boxplot()

# One-way ANOVA
PGaov <- aov(weight ~ group, data = PlantGrowth)

# TukeyHSD
TukeyHSD(PGaov)

# Verdict: Control > Treatment 2 > Treatment 1 

Reflection

When I wrote this diary, I also reviewed my code in the exercise. I realized that if I had read the scenario more meticulously, perhaps I could’ve done it better. But, oh well.

I learned that I have to be more careful and thorough when doing hypothesis testing. It is a good thing to fully understand what problem I’m facing.

Update (19:45 GMT +7):

I found a table that summarizes “what h-test” for “what data types”! It’s really good.

Questioning My Comprehension about Hypothesis Testing