HW 04: Describing Distributions

Let the data be beautiful

Purpose

There are a variety of conventional ways to visualize data - tables, histograms, bar graphs, etc. The purpose is always to examine the distribution of variables related to your research question. You will create a plot, follow up each graphic with a table of summary statistics (for quantitative variables) or frequency and proportion table (for categorical), and then a summary paragraph that brings it all together.

Instructions

Right click and download this template, then put it into your Homework folder.

PART 1: Completely describe 2 categorical and 2 quantitative variables using all of the following:

  • An explanation of what the variable is, and how it is measured.
  • A table of summary statistics
    • using table() for categorical and summary() for quantitative
  • An appropriate plot with titles and axes labels
    • use plot_frq for categorical and both a histogram and boxplot for quantitative
  • A short paragraph description in full complete English sentences with supporting numbers.
    • Categorical must include N and % for _at least_the largest category
    • For quantitative you must describe the center, shape and spread. Note any outliers.
  • Take note of categories that should not be there, outliers or potential data mistakes (e.g. having 99 children when 99 was a missing data code you forgot to deal with)

PART 2: Create an all-inclusive summary table using the tbl_summary() function inside the gtsummary package. Use the code at the bottom of this page and replace the variables used. Include all 4 variables that were analyzed as part of this assignment.

Submission instructions

  • Upload your PDF to Canvas by the due date.

Example - Part I

# Load libraries
library(tidyverse)
library(sjPlot)
library(ggpubr)
library(gtsummary)

load(here::here("data/depression_clean.Rdata")) # replace depression_clean.Rdata with YOUR data set name that you exported at the end of hw3

Depressed

The depressed variable is an indicator variable created from the CESD scale to identify potential clinical depression. This variable has two levels: depressed or not depressed.

table(clean$depressed)

    depressed not depressed 
           45           249 
plot_frq(clean$depressed)

The majority of respondents in this data set are not considered depressed; 84.7% (n=249) individuals did not meet the threshold on the CESD scale to indicate potential clinical depression.

Age

summary(clean$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   28.00   42.50   44.41   59.00   89.00 
ggviolin(clean$age, add = c("jitter", "boxplot")) + 
  coord_flip() + labs(x="", y="Age")
gghistogram(clean$age, add_density = TRUE) + xlab("Age")

The ages of respondents range from 18 to 89 years old, is bimodal with peaks around 25 and 55ish. The mean age is 44.4 with the median very close at 42.5 - confirming the lack of skew. The standard deviation is 18, and 50% of the reported ages lie between 28 and 59.

Example - Part II

tbl_summary(clean, 
            include = c(age, depressed), 
            statistic = list(
              all_continuous() ~ "{mean} ({sd})",
              all_categorical() ~ "{n} / {N} ({p}%)"
    ))
Characteristic N = 2941
age 44 (18)
depressed
    depressed 45 / 294 (15%)
    not depressed 249 / 294 (85%)
1 Mean (SD); n / N (%)