Interaction Model + Model Selection - Suggested Answers

STA 199

Packages

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5          ✔ rsample      1.2.0     
✔ dials        1.2.0          ✔ tune         1.1.2     
✔ infer        1.0.5          ✔ workflows    1.1.3.9000
✔ modeldata    1.2.0          ✔ workflowsets 1.0.1     
✔ parsnip      1.1.1.9000     ✔ yardstick    1.2.0     
✔ recipes      1.0.8          
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ recipes::check()  masks devtools::check()
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

As a reminder, here is the code on how to write indicator functions:

\[ level = \begin{cases} 1 & \text{if condition is met}\\ 0 & \text{otherwise} \end{cases} \]

Penguins - Interaction

Before we fit the model, let’s visualize it. Recreate the plot from the slides that shows the interaction model between body mass (y), flipper length (x1) and island (x2).

penguins |> 
  ggplot(
    aes(x = flipper_length_mm, y = body_mass_g, color = island)
  ) +
  geom_point() +
  geom_smooth(method = "lm" , se = F)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

What is an interaction?

When the relationship between x and y depends on values or levels of z

Is there evidence of an interaction? Why or why not?

Yes! The relationship between body mass and flipper length is different across levels of island.

Fit the interaction model that models body mass by island and flipper length. Name this model peng_int. Display the summary output. Next, write out the estimated model using proper notation.

peng_int <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm * island , data = penguins)

tidy(peng_int)
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                        -5464.     431.      -12.7  2.51e-30
2 flipper_length_mm                     48.5      2.05     23.7  1.66e-73
3 islandDream                         3551.     969.        3.66 2.89e- 4
4 islandTorgersen                     3218.    1680.        1.92 5.62e- 2
5 flipper_length_mm:islandDream        -19.4      4.94     -3.93 1.03e- 4
6 flipper_length_mm:islandTorgersen    -17.4      8.73     -1.99 4.69e- 2

$$

= -5464 + 48.5flipper + 3551Dream \ + 3218Torgersen - 19.4flipperDream \ - 17.4flipper*Torgersen \

Dream = \[\begin{cases} 1 & \text{if Dream level}\\ 0 & \text{else} \end{cases}\]

\

  Tor =
\begin{cases}
  1 & \text{if Torgersen}\\
  0 & \text{otherwise}
\end{cases}

$$

Now, let’s simplify the equation to only look at penguins on the Bisoce island. What about the Dream island?

$$

= -5464 + 48.5*flipper

$$

$$

= -5464 + 48.5flipper + 35511 - 19.4flipper1 \

= (-5464 + 3551) + (48.5-19.4)*flipper

$$

You then can compare intercepts and slope coefficients across levels. What do these two lines tell you about the relationship between flipper length and body mass across these two levels?

When flipper length is 0, we estimate the mean body mass of penguins on the Bisoce island to be smaller than those on the Dream island. However, we estimate the mean body mass of Biscoe penguins to increase more quickly (at a higher rate) than those on the Dream penguins as flipper length increases

Note: We don’t interpret the interaction terms from the full model. We often simplify and compare across like above!

Now, suppose you want to change the baseline of this model. Let’s practice this by changing the baseline to Dream. You can do this a few ways, including with fct_relevel or factor.

penguins2 <- penguins |>
  mutate(island = factor(island, levels = c("Dream" , "Biscoe", "Torgersen")))

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm*island, data = penguins2) |> tidy()
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       -1913.      868.      -2.20  2.82e- 2
2 flipper_length_mm                    29.1       4.49     6.49  3.12e-10
3 islandBiscoe                      -3551.      969.      -3.66  2.89e- 4
4 islandTorgersen                    -333.     1841.      -0.181 8.57e- 1
5 flipper_length_mm:islandBiscoe       19.4       4.94     3.93  1.03e- 4
6 flipper_length_mm:islandTorgersen     1.99      9.60     0.208 8.36e- 1

Model Selection

See slides

R-squared demonstration

model_1 <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island, data = penguins) 

set.seed(33)
penguins <- penguins |>
  mutate(random_numbers = rnorm(nrow(penguins), 1, 10))

model_2 <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island * random_numbers, data = penguins) 


glance(model_1)$r.squared
[1] 0.3935772
glance(model_2)$r.squared
[1] 0.3952714

Did r-squared increase or decrease from model 1 to model 2? Why?

It went up! R-squared will always go up when we add variables to our model, even if the variables do not make sense