# Load libraries
library(tidyverse)
library(janitor) # for the clean_names. Not required for every data set.
# Read in the data and convert the variable names to lower case
<- read_delim(here::here("data/Depress.txt")) %>% clean_names()
raw
# Select only the variables that I am interested in
<- raw %>% select(age, marital, cesd, health) mydata
HW 03: Data Management
Wrangling your data into analyzable form
Purpose
By now you should know what variables you want to use, and you’ve looked over the codebook enough now that you have an idea of some potential problems that you will encounter. This assignment uses your chosen research data, and the variables that you chose in the last assignment when you created a personal research codebook. You will thoughtfully review the variables you are interested in, document which ones will need changed and how.
All raw data needs to stay raw, and all changes need to be documented. You will create a script/code file that will make all changes to the data in a programatically and reproducible way. You will create a single code file that imports your raw data, performs some data cleaning steps, and saves out an analysis ready data set that you will use throughout the semester.
You are not expected to have completed data management for every one of your variables under consideration by the submission date. I want to see a VERY good effort has been made (raw data read in, at least 2 quant and 2 cat variables dealt with, analysis data saved out.)
Check the rubric in Canvas for more grade specific details.
Instructions
- Log in to Posit Cloud
- Upload your data to your
data
folder - Upload this starter data management file to your
script
folder in Posit Cloud. - Rename this script to use your data set name. E.g.
dm_addhealth
if you are using the Add Health data, ordm_ool
for the Outlook on Life data set.
- Upload your data to your
- After you load the data, restrict the variables to only the ones you are investigating using the
select()
function. - Explore and clean at least 2 categorical and 2 numeric variables. One by one, check each variable for necessary adjustments. Complete each of the following steps for each variable.
- First explain in English what the variable name is and what it measures.
- Identify the data type of the variable using the
class()
function. Ask (and answer) the question “Does this match with the intended data type?” - Then examine the variable using the
table()
function for categorical variables, and thesummary()
andhist()
functions for numeric variables. - Recode the data as necessary
- Always confirm your recodes worked as intended by creating another table or summary.
- Save the resulting data set to your
data
folder asdatasetname_clean.Rdata
, e.g.addhealth_clean.Rdata
using thesave()
function
Submission instructions
- You will submit your code file to Canvas. This must run on my computer.
- What comes out is what I grade.
- Render your script file before you submit to ensure that it works and looks the way you expect it to.
Example
Here is an example using R Markdown/Quarto. Everything you see below this line comes is written in the .qmd
script itself. This is an example of literate programming where you write what you are going to do, do it in code, and then write what you have learned after each step.
Setup
General Health
The variable health
records a persons perceived general health as being either Excellent, Good, Fair or Poor. This is considered an ordinal categorical variable.
table(mydata$health)
1 2 3 4
130 115 35 14
class(mydata$health)
[1] "numeric"
The variable health
currently is an integer with numeric values 1-4, but the codebook states that this is a categorical variable where 1=Excellent, 2=Good, 3=Fair, 4=Poor. So I need to convert this numeric variable to a factor variable. There are no values outside the 1-4 range, such as a -9 that codes for missing data so I do not need to make any further adjustments (You want to code out missing before you convert variables to factors)
$health_cat <- factor(mydata$health, labels=c("Excellent", "Good", "Fair", "Poor")) mydata
I will confirm that the recode worked by making a two-way table
table(mydata$health, mydata$health_cat, useNA="always")
Excellent Good Fair Poor <NA>
1 130 0 0 0 0
2 0 115 0 0 0
3 0 0 35 0 0
4 0 0 0 14 0
<NA> 0 0 0 0 0
This shows that all 1’s are now ‘excellent’, 4’s are now ‘poor’ and so forth.
CESD
The numeric variable CESD
represents the depression index scale, which is a sum of 20 component variables. A high score indicates a person who is depressed, with 16 being the typical cutoff for creating a binary indicator of depression.
summary(mydata$cesd)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 7.000 8.884 12.000 47.000
hist(mydata$cesd)
There are no values outside of the expected range, and no missing values.
Indicator of Depression
Create the binary indicator of depression using CESD
$depressed <- ifelse(mydata$cesd > 16, "depressed", "not depressed")
mydatatable(mydata$depressed, useNA="always")
depressed not depressed <NA>
45 249 0
No missing values were accidentally generated,
Export cleaned data set
Keep only selected variables and save to an external file
<- mydata %>% select(age, marital, cesd, depressed, health_cat)
clean save(clean, file=here::here("data/depression_clean.Rdata"))