Lab 6 - Logistic regression

Lab
Important

This lab is due Wednesday, November 8 at 11:59pm.

Packages

You’ll need the following packages for today’s lab.

Data

The data can be found in the dsbox package, and it’s called gss16. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package.

If you would like to explicitly load the data into your environment so you can view it, you can do so by running this code.

gss16 <- gss16

You can find out more about the dataset by inspecting its documentation, which you can access by running ?gss16 in the Console or using the Help menu in RStudio to search for gss16. You can also find this information here.

Exercises

Exercise 1 - Data wrangling

Important

Remember: For each exercise, you should choose one person to type. All others should contribute to the discussion, but only one person should type up the answer, render the document, commit, and push to GitHub. All others should not touch the document.

  1. Create a new data frame called gss16_advfront that includes the variables advfront, educ, polviews, and wrkstat. Then, use the drop_na() function to remove rows that contain NAs from this new data frame. Sample code is provided below.
gss16_advfront <- gss16 |>
  select(___, ___, ___, ___) |>
  drop_na()
  1. Re-level the advfront variable such that it has two levels: "Strongly agree" and "Agree" combined into a new level called "Agree" and the remaining levels combined into "Not agree". Then, re-order the levels in the following order: "Agree" and "Not agree". Finally, count() how many times each new level appears in the advfront variable.

Hint: You can do this in various ways, but you’ll likely need to use mutate along with case_when() to re-level the variable and fct_relevel() to re-order the levels. One way to specify the levels you want to combine is using the str_detect() function to detect the existence of key words. If you do this, note that some words have inconsistent capitalization (e.g., “Agree” vs. “…agree”). To address this, you can use “[Aa]gree” within str_detect() to identify levels that contain either “Agree” or “agree”. (However, solve the problem however you like; this is just one option!)

  1. Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal" and those that have the word “conservative” in them are lumped into a level called "Conservative". Then, re-order the levels in the following order: "Conservative" , "Moderate", and "Liberal". Finally, count() how many times each new level appears in the polviews variable.
Important

After the team member working on Exercise 1 renders, commits, and pushes, all other team members should pull. Then, choose a new team member to write the answer to Exercise 2. (And so on for the remaining exercises.)

Exercise 2 - Train and test sets

Now, let’s split the data into training and test sets so that we can evaluate the models we’re going to fit by how well they predict outcomes on data that wasn’t used to fit the models.

Specify a random seed of 1234 (i.e., include set.seed(1234) at the beginning of your code chunk), and then split gss16_advfront randomly into a training set train_data and a test set test_data. Do this so that the training set contains 80% of the rows of the original data.

Note: The slides from 11/2 give an example of how to split data into training and testing sets.

Exercise 3 - Logistic Regression

  1. Using the training data, specify a logistic regression model that predicts advfront by educ. In particular, the model should predict the probability that advfront has value "Not agree". Name this model model1. Report the tidy output below.

  2. Write out the estimated model in proper notation. State the meaning of any variables in the context of the data.

  3. Using your estimated model, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (Agree in advfront) if you have an education of 7 years. Hint: Use type = "prob".

Exercise 4 - Another model

  1. Again using the training data, fit a new model that adds the additional explanatory variable of polviews. Name this model model2. Report the tidy output below.

  2. Now, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (Agree in advfront) if you have an education of 7 years and are Conservative.

Exercise 5 - Evaluating models with AIC

  1. Report the AIC values for each of model1 and model2.

  2. Based on your results in part a, does it appear that including political views in addition to years of education is useful for modeling whether employees agree with the statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government”? Explain.

Exercise 6 - Extension: Confusion Matrices

You calculated different types of probabilities in class on Thursday. we are going to extend are calculations to look at more model performance measures. A confusion matrix is a table commonly used in classification tasks to evaluate the results of using a model to predict on data it has not seen before. The confusion matrix displays the number of test samples that land in each combination of predicted and true value. For more information, read this quick introduction. We will be using the augment function in the broom package to create our predictions for the entire test dataset, conf_mat function in the yardstick package to create confusion matrices, then autoplot to visualize them.

  1. For each of model1 and model2, create a heatmap version of a confusion matrix using the test data from Exercise 2.

    You can use the code below to make the plot for model 1. Then adapt it as needed for model 2.

model1_pred <- broom::augment(model1, test_data)

model1_pred |>
  conf_mat(truth = "advfront", #This column has our true values
           estimate = ".pred_class", #Predictions are in this column
           ) |>
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix for model 1")

There are several metrics of model performance that we can read from the confusion matrix. A few of them are (these are also included with visual explanation in the link above):

  • Accuracy: The proportion of correct predictions to all predictions

  • Precision: The proportion of correct positive predictions to all positive predictions (true positives + false positives)

  • Recall: The proportion of correct positive predictions to all real positives (so true positives + false negatives)

  • F1-Score: \(2 *\frac{Precision*Recall}{Precision+Recall}\) (the harmonic mean of precision and recall)

Accuracy, specifically, is a flawed metric in instances where the data is imbalanced, that is, the proportion of positive and negative outcomes in the data is not even.

  1. Report the accuracy and F1-score for each of your models.

  2. Is accuracy or F1-score a better measure to use in assessing model performance here? Explain.

  3. Are the values from 6b consistent with your conclusion in Exercise 5b? Explain.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to render and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”. Only one team member should submit to Gradescope, but they should add all other team members to the submission.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
  • Do not select any pages of your .pdf submission to be associated with the “Workflow & formatting” question.
  • Only submit one submission per team on Gradescope.

Grading

Component Points
Ex 1 8
Ex 2 3
Ex 3 10
Ex 4 5
Ex 5 5
Ex 6 14
Workflow & formatting 5
Total 50
Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

  • linking all pages appropriately on Gradescope

  • putting your team and member names in the YAML at the top of the document

  • committing the submitted version of your .qmd to GitHub

  • Are you under the 80 character code limit? (You shouldn’t have to scroll to see all your code). Pipes %>%, |> and ggplot layers + should be followed by a new line

  • You should be consistent with stylistic choices, e.g. only use 1 of = vs <- and %>% vs |>

  • All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.