Lab 5 - Predicting a numerical outcome
This lab is due Wednesday, November 1 at 11:59pm.
Learning Goals
Use tidymodels framework to build a linear model and estimate regression parameters
Visualize your linear model
Intro
Parasites can cause infectious disease – but not all animals are affected by the same parasites. Some parasites are present in a multitude of species and others are confined to a single host. It is hypothesized that closely related hosts are more likely to share the same parasites. More specifically, it is thought that closely related hosts will live in similar environments and have similar genetic makeup that coincides with optimal conditions for the same parasite to flourish.
In this lab we will see how much evolutionary history predicts parasite similarity.
The Data
Today’s dataset comes from an Ecology Letters paper by Cooper at al. (2012) “Phylogenetic host specificity and understanding parasite sharing in primates” which can be found here. The goal of the paper was to identify the ability of evolutionary history and ecological traits to characterize parasite host specificity.
Each row of the data contains two species, species1
and species2
.
Subsequent columns describe metrics that compare the species:
divergence_time
: how many (millions) of years ago the two species diverged. i.e. how many million years ago they were the same species.distance
: geodesic distance between species geographic range centroids (in kilometers)BMdiff
: difference in body mass between the two species (in grams)precdiff
: difference in mean annual precipitation across the two species geographic ranges (mm)parsim
: a measure of parasite similarity (proportion of parasites shared between species, ranges from 0 to 1.)
The data are available in parasites.csv
in the data folder.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualization.
Exercises
Pick one member of the team to write the answer to Exercise 1. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
To get started, load the data and save the data frame as parasites
.
- Let’s start by examining the relationship between
divergence_time
andparsim
.Based on the goals of the analysis, what is the response variable?
Visualize the relationship between the two variables.
Use the visualization to describe the relationship between the two variables.
After the team member working on Exercise 1 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 2. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
-
Next, we’ll model this relationship.
Fit the model and write the estimated regression equation.
Interpret the slope and the intercept in the context of the data.
Recreate the visualization from Exercise 1, this time adding a regression line to the visualization.
What do you notice about the prediction (regression) line that may be strange, particularly for very large divergence times?
After the team member working on Exercise 2 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 3. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
-
Since
parsim
takes values between 0 and 1, we want to transform this variable so that it can range between (−∞,+∞). This will be better suited for fitting a regression model (and interpreting predicted values!)- Using mutate, create a new variable
transformed_parsim
that is calculated aslog(parsim/(1-parsim))
. Add this variable to your data frame. Note:log()
in R represents taking the nautral log.
- Using mutate, create a new variable
The function \(\log\left(\frac{p}{1-p}\right)\) is called the “logit” function. As its name suggests, we commonly think of it in the sense of logistic regression. In logistic regression, we take a linear combination of our \(x\) and our \(\beta\) to be equal to the logit of the probability \(\log\left(\frac{\hat{p}}{1-\hat{p}}\right)\) , since logit can take the values \((-\infty, \infty)\), and then we transform back to \(\hat{p}\), which takes values between 0 and 1. Here, since we are trying to build a linear model and our outcome variable takes values between 0 and 1, we simply use the logit function directly \(\log\left(\frac{parsim}{1-parsim}\right)\), which gives an outcome variable transformed_parsim
that can take values between \((-\infty, \infty)\), which is more appropriate to fit a linear model.
Then, visualize the relationship between divergence_time and
transformed_parsim
. Add a regression line to your visualization.Write a 1-2 sentence description of what you observe in the visualization.
After the team member working on Exercise 3 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 4. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
-
Which variable is the strongest individual predictor of parasite similarity between species? To answer this question, begin by fitting separate linear regression models predicting
transformed_parsim
with each of the following predictor variables:divergence_time
distance
BMdiff
precdiff
Do not report the model outputs in a tidy format but save each one as dt_model
, dist_model
, BM_model
, and prec_model
, respectively. Then,
Report the slopes for each of these models. Use proper notation.
To answer the question of interest, would it be useful to compare the slopes in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
After the team member working on Exercise 4 renders, commits, and pushes, another team member should pull their changes and render the document. Then, they should write the answer to Exercise 5. All others should contribute to the discussion but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.
- Regardless of your answer to exercise 4b, we will also calculate the \(R^2\) of each model to help us identify the strongest individual linear predictor of
transformed_parsim
. \(R^2\) measures the percent of the variability in the response that is explained by the model.
As you may have guessed from the name, \(R^2\) can be calculated by squaring the correlation when we have a simple linear regression model. The correlation, r, takes values between -1 and 1, so \(R^2\) takes a value between 0 and 1. Intuitively, if r=1 or −1, then \(R^2\)=1, indicating the model perfectly fits the data. If r≈0 then \(R^2\)≈0, indicating the model is a very bad fit for the data.
You can calculate \(R^2\) using the glance function. For example, you can calculate \(R^2\) for dt_model using the code glance(dt_model)$r.squared
.
Calculate and report \(R^2\) for each model fit in the previous exercise.
To answer our question of interest, would it be useful to compare the \(R^2\) in each model to choose the variable that is the strongest predictor of parasite similarity? Why or why not?
After the team member working on Exercise 5 renders, commits, and pushes, all other team members should pull the changes and render the document. Finally, a team member different than the one responsible for typing up responses to Exercise 5 should do the last task outlined below.
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to render and push changes.
You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.
Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html.
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with each exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your .pdf submission to be associated with the “Workflow & formatting” question.
- Please make only one submission per team, and add the rest of the team members to the submission.
Grading
Component | Points |
---|---|
Ex 1 | 8 |
Ex 2 | 14 |
Ex 3 | 8 |
Ex 4 | 8 |
Ex 5 | 7 |
Workflow & formatting | 5 |
Total | 50 |
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes: - linking all pages appropriately on Gradescope - putting your team and member names in the YAML at the top of the document - committing the submitted version of your .qmd
to GitHub - Are you under the 80 character code limit? (You shouldn’t have to scroll to see all your code). Pipes %>%
, |>
and ggplot layers +
should be followed by a new line - You should be consistent with stylistic choices, e.g. only use 1 of =
vs <-
and %>%
vs |>
- All binary operators should be surrounded by space. For example x + y
is appropriate. x+y
is not.