Introduction

Welcome to dissertationData! This package provides easy access to data from the Youth Risk Behavior Survey (YRBS), a valuable resource for studying adolescent health trends.

Whether you’re an experienced analyst or just starting out, this guide will walk you through how to load and work with the data in R efficiently.

Installation

If you haven’t installed the package yet, you can install it from GitHub:

# Install from GitHub
install.packages("devtools")
devtools::install_github("yourusername/dissertationData")

Once installed, load the package:

library(dissertationData)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: stringr
#> Loading required package: tidyr
#> Loading required package: ggplot2

Working with the Data

This package provides different versions of the YRBS data. The simplest option is the raw datasets, available year by year from 2015 to 2023.

You can access the data using two different functions:

  1. The base R function data()

  2. A package-specific function load_yrbs()

Let’s explore both approaches and their differences.

Example: Examining Suicide Ideation in 2023 by Grade

To demonstrate how to work with the data, let’s analyze the percentage of suicide ideation in 2023 by grade.

Option 1: Using the data() function

data("raw2023")

This function loads the dataset into your environment as a promise, meaning that the object raw2023 is only accessible once you interact with it through another function.

Option 2: Using load_yrbs()

For more control over the object name and a direct approach to loading data, use the load_yrbs() function from this package:

yrbs_df <- load_yrbs("raw2023")

This function immediately assigns the dataset to the yrbs_df object, making it easier to work with.

Let’s continue interacting with the data to demonstrate how to answer questions using this dataset in the R environment.

I will proceed by selecting the variables I care most about, which are the grade and the suicide attempts items.

Considering we lost the data dictionaries due to recent federal actions, you can access the ones I was able to recover here. However, the package also includes dictionaries for each of the datasets.

Let’s use one of the dictionaries.

dict_2023 <- load_dictionary("2023")
dict_2023
#> # A tibble: 250 × 7
#>    position variable description             `column type` missing levels origin
#>       <int> <chr>    <chr>                   <chr>           <int> <name> <chr> 
#>  1        1 Q1       How old are you         fct                98 <chr>  2023  
#>  2        2 Q2       What is your sex        fct               158 <chr>  2023  
#>  3        3 Q3       In what grade are you   fct               193 <chr>  2023  
#>  4        4 Q4       Are you Hispanic/Latino fct               254 <chr>  2023  
#>  5        5 Q5       What is your race       chr              1389 <NULL> 2023  
#>  6        6 Q6       How tall are you        dbl              2289 <NULL> 2023  
#>  7        7 Q7       How much do you weigh   dbl              2289 <NULL> 2023  
#>  8        8 Q8       Seat belt use           fct              5032 <chr>  2023  
#>  9        9 Q9       Riding with a drinking… fct               505 <chr>  2023  
#> 10       10 Q10      Drinking and driving    fct              1127 <chr>  2023  
#> # ℹ 240 more rows

Now that we have the dictionary loaded, let’s search for the variables of interest: suicide attempts and grade. We can do this by querying the dictionary using dplyr.

library(dplyr)
library(stringr)

# Filter for variables related to suicide attempts and grade
suicide_grade_vars <- dict_2023 %>%
  filter(str_detect(description, regex("suicide|grade", ignore_case = TRUE)))

# View the results
print(suicide_grade_vars)
#> # A tibble: 11 × 7
#>    position variable description             `column type` missing levels origin
#>       <int> <chr>    <chr>                   <chr>           <int> <name> <chr> 
#>  1        3 Q3       In what grade are you   fct               193 <chr>  2023  
#>  2       27 Q27      Considered suicide      fct               436 <chr>  2023  
#>  3       28 Q28      Made a suicide plan pa… fct              1737 <chr>  2023  
#>  4       29 Q29      Actually attempted sui… fct               767 <chr>  2023  
#>  5       30 Q30      Injurious suicide atte… fct              5468 <chr>  2023  
#>  6       87 Q87      Grades in school        fct              3568 <chr>  2023  
#>  7      127 QN27     Seriously considered a… dbl               436 <NULL> 2023  
#>  8      128 QN28     Made a plan about how … dbl              1737 <NULL> 2023  
#>  9      129 QN29     Attempted suicide       dbl               767 <NULL> 2023  
#> 10      130 QN30     Had a suicide attempt … dbl              5468 <NULL> 2023  
#> 11      184 QN87     Described their grades… dbl              3568 <NULL> 2023

This query will return all variables in the 2023 dataset that mention suicide or grade in their descriptions. This helps ensure we select the correct variables for our analysis.

Now that we have identified the variables of interest, Q3 and QN29 (all QN variables are dichotomized), let’s create a summary table using these variables.

library(gtsummary)
# Create a summary table for suicide attempts by grade

yrbs_df %>%
  select(Q3, QN29) %>%
  tbl_summary(
    by = QN29, 
    statistic = list(all_categorical() ~ "{p}%"),
    label = list(Q3 ~ "Grade")
  ) %>%
  modify_header(label ~ "Suicide Attempts") %>%
  modify_caption("**Table 1: Percentage of Suicide Attempts by Grade**")
#> 767 missing rows in the "QN29" column have been removed.
#> ! Column(s) "Q3" are class "haven_labelled".
#>  This is an intermediate data structure not meant for analysis.
#>  Convert columns with `haven::as_factor()`, `labelled::to_factor()`,
#>   `labelled::unlabelled()`, and `unclass()`. Failure to convert may have
#>   unintended consequences or result in error.
#> <https://haven.tidyverse.org/articles/semantics.html>
#> <https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled>
Table 1: Percentage of Suicide Attempts by Grade
Suicide Attempts 1
N = 1,995
1
2
N = 17,341
1
Grade

    1 31% 28%
    2 27% 27%
    3 23% 24%
    4 18% 20%
    5 0.9% 0.2%
    Unknown 14 139
1 %

Almost there! It would be great to clarify what “1” and “2” represent, as well as the grade levels corresponding to the options 1, 2, 3, 4, and 5.

Thanks to the dictionary, we know that 1 means “yes” and 2 means “no”. This package provides a handy function, binary_to_text(), to easily convert these values into readable labels. Additionally, the haven package offers the as_factor() function, which can be used to convert categorical variables into factors, as demonstrated in the code below:

yrbs_df %>%
  select(Q3, QN29) %>%
  mutate(QN29 = binary_to_text(QN29), 
         Q3 = haven::as_factor(Q3)) |> 
   tbl_summary(
    by = QN29, 
    statistic = list(all_categorical() ~ "{p}%"),
    label = list(Q3 ~ "Grade")
  ) %>%
  modify_header(label ~ "Suicide Attempts") %>%
  modify_caption("**Table 1: Percentage of Suicide Attempts by Grade**")
#> 18108 missing rows in the "QN29" column have been
#> removed.
Table 1: Percentage of Suicide Attempts by Grade
Suicide Attempts Yes
N = 1,995
1
Grade
    9th grade 31%
    10th grade 27%
    11th grade 23%
    12th grade 18%
    Ungraded or other grade 0.9%
    Unknown 14
1 %

And just like that, we have a clear and well-formatted table showing that 9th graders have the highest proportion of students reporting suicidal thoughts.

Note: These proportions are calculated from the raw data and do not account for survey weights. If weighted estimates are needed, appropriate adjustments should be made before interpreting the results. We will cover how to properly apply survey weights in another vignette.

I hope this example is helpful in demonstrating how to use this package effectively within the R environment! 🚀