getting_started.Rmd
Welcome to dissertationData! This package provides easy access to data from the Youth Risk Behavior Survey (YRBS), a valuable resource for studying adolescent health trends.
Whether you’re an experienced analyst or just starting out, this guide will walk you through how to load and work with the data in R efficiently.
If you haven’t installed the package yet, you can install it from GitHub:
# Install from GitHub
install.packages("devtools")
devtools::install_github("yourusername/dissertationData")
Once installed, load the package:
library(dissertationData)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: stringr
#> Loading required package: tidyr
#> Loading required package: ggplot2
This package provides different versions of the YRBS data. The simplest option is the raw datasets, available year by year from 2015 to 2023.
You can access the data using two different functions:
The base R function data()
A package-specific function load_yrbs()
Let’s explore both approaches and their differences.
To demonstrate how to work with the data, let’s analyze the percentage of suicide ideation in 2023 by grade.
data()
function
data("raw2023")
This function loads the dataset into your environment as a
promise, meaning that the object raw2023
is only accessible once you interact with it through another
function.
load_yrbs()
For more control over the object name and a direct approach to
loading data, use the load_yrbs()
function from this
package:
yrbs_df <- load_yrbs("raw2023")
This function immediately assigns the dataset to the
yrbs_df
object, making it easier to work with.
Let’s continue interacting with the data to demonstrate how to answer questions using this dataset in the R environment.
I will proceed by selecting the variables I care most about, which are the grade and the suicide attempts items.
Considering we lost the data dictionaries due to recent federal actions, you can access the ones I was able to recover here. However, the package also includes dictionaries for each of the datasets.
Let’s use one of the dictionaries.
dict_2023 <- load_dictionary("2023")
dict_2023
#> # A tibble: 250 × 7
#> position variable description `column type` missing levels origin
#> <int> <chr> <chr> <chr> <int> <name> <chr>
#> 1 1 Q1 How old are you fct 98 <chr> 2023
#> 2 2 Q2 What is your sex fct 158 <chr> 2023
#> 3 3 Q3 In what grade are you fct 193 <chr> 2023
#> 4 4 Q4 Are you Hispanic/Latino fct 254 <chr> 2023
#> 5 5 Q5 What is your race chr 1389 <NULL> 2023
#> 6 6 Q6 How tall are you dbl 2289 <NULL> 2023
#> 7 7 Q7 How much do you weigh dbl 2289 <NULL> 2023
#> 8 8 Q8 Seat belt use fct 5032 <chr> 2023
#> 9 9 Q9 Riding with a drinking… fct 505 <chr> 2023
#> 10 10 Q10 Drinking and driving fct 1127 <chr> 2023
#> # ℹ 240 more rows
Now that we have the dictionary loaded, let’s search for the
variables of interest: suicide attempts and
grade. We can do this by querying the dictionary using
dplyr
.
library(dplyr)
library(stringr)
# Filter for variables related to suicide attempts and grade
suicide_grade_vars <- dict_2023 %>%
filter(str_detect(description, regex("suicide|grade", ignore_case = TRUE)))
# View the results
print(suicide_grade_vars)
#> # A tibble: 11 × 7
#> position variable description `column type` missing levels origin
#> <int> <chr> <chr> <chr> <int> <name> <chr>
#> 1 3 Q3 In what grade are you fct 193 <chr> 2023
#> 2 27 Q27 Considered suicide fct 436 <chr> 2023
#> 3 28 Q28 Made a suicide plan pa… fct 1737 <chr> 2023
#> 4 29 Q29 Actually attempted sui… fct 767 <chr> 2023
#> 5 30 Q30 Injurious suicide atte… fct 5468 <chr> 2023
#> 6 87 Q87 Grades in school fct 3568 <chr> 2023
#> 7 127 QN27 Seriously considered a… dbl 436 <NULL> 2023
#> 8 128 QN28 Made a plan about how … dbl 1737 <NULL> 2023
#> 9 129 QN29 Attempted suicide dbl 767 <NULL> 2023
#> 10 130 QN30 Had a suicide attempt … dbl 5468 <NULL> 2023
#> 11 184 QN87 Described their grades… dbl 3568 <NULL> 2023
This query will return all variables in the 2023 dataset that mention suicide or grade in their descriptions. This helps ensure we select the correct variables for our analysis.
Now that we have identified the variables of interest,
Q3 and QN29 (all QN
variables are dichotomized), let’s create a summary table using these
variables.
library(gtsummary)
# Create a summary table for suicide attempts by grade
yrbs_df %>%
select(Q3, QN29) %>%
tbl_summary(
by = QN29,
statistic = list(all_categorical() ~ "{p}%"),
label = list(Q3 ~ "Grade")
) %>%
modify_header(label ~ "Suicide Attempts") %>%
modify_caption("**Table 1: Percentage of Suicide Attempts by Grade**")
#> 767 missing rows in the "QN29" column have been removed.
#> ! Column(s) "Q3" are class "haven_labelled".
#> ℹ This is an intermediate data structure not meant for analysis.
#> ℹ Convert columns with `haven::as_factor()`, `labelled::to_factor()`,
#> `labelled::unlabelled()`, and `unclass()`. Failure to convert may have
#> unintended consequences or result in error.
#> <https://haven.tidyverse.org/articles/semantics.html>
#> <https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled>
Suicide Attempts |
1 N = 1,9951 |
2 N = 17,3411 |
---|---|---|
Grade | ||
1 | 31% | 28% |
2 | 27% | 27% |
3 | 23% | 24% |
4 | 18% | 20% |
5 | 0.9% | 0.2% |
Unknown | 14 | 139 |
1 % |
Almost there! It would be great to clarify what “1” and “2” represent, as well as the grade levels corresponding to the options 1, 2, 3, 4, and 5.
Thanks to the dictionary, we know that 1 means “yes” and 2 means
“no”. This package provides a handy function,
binary_to_text()
, to easily convert these values into
readable labels. Additionally, the haven
package offers the
as_factor()
function, which can be used to convert
categorical variables into factors, as demonstrated in the code
below:
yrbs_df %>%
select(Q3, QN29) %>%
mutate(QN29 = binary_to_text(QN29),
Q3 = haven::as_factor(Q3)) |>
tbl_summary(
by = QN29,
statistic = list(all_categorical() ~ "{p}%"),
label = list(Q3 ~ "Grade")
) %>%
modify_header(label ~ "Suicide Attempts") %>%
modify_caption("**Table 1: Percentage of Suicide Attempts by Grade**")
#> 18108 missing rows in the "QN29" column have been
#> removed.
Suicide Attempts |
Yes N = 1,9951 |
---|---|
Grade | |
9th grade | 31% |
10th grade | 27% |
11th grade | 23% |
12th grade | 18% |
Ungraded or other grade | 0.9% |
Unknown | 14 |
1 % |
And just like that, we have a clear and well-formatted table showing that 9th graders have the highest proportion of students reporting suicidal thoughts.
Note: These proportions are calculated from the raw data and do not account for survey weights. If weighted estimates are needed, appropriate adjustments should be made before interpreting the results. We will cover how to properly apply survey weights in another vignette.
I hope this example is helpful in demonstrating how to use this package effectively within the R environment! 🚀