Instructions and Overview

For this assignment, you will import the dataset you will be working with for the entirety of the project into R, and then we will clean the dataset and examine it further. This will be the first lab in which you be coding. As you work through this lab, you should run each of the code chunks in order. Be sure to reference lab 1 if you cannot remember how to run code chunks. In the “Data Cleaning” section, running the code in order and only once is particularly important as you can overwrite some of your cleaning steps if you run the code more than once or out of order. If this happens, you will need to re-import the relevant datasets.

I have a standard format for how I instruct you to fill in code below. Coding prompts will be described outside of each code chunk. Inside each code chunk you will find specific instructions for what to do to fill in the code listed at the top. In most code chunks, I also include a line that can serve as a template for how the code should be filled in. In this line, df stands for data frame and marks where the name of your data frame should go in the code. VARIABLE_NAME refers to a variable in your data frame and marks where a variable should be referenced in the code.

These instructions and templates will have a ‘#’ in front of them. This is known as a comment. We include a comment in order to mark text in the code chunk that should not be executed. When you remove the ‘#’ or uncomment it, the specific uncommented text will become executable. If you try to remove a comment in front of text that R doesn’t recognize and then run the code, you will get an error. Be sure to not uncomment the instructions or the template as this is text that R will not recognize.

For all of the prompts in this lab, I have started the code for you, and left blanks for you to fill in. The lines that you will be filling in are currently commented out. As you work through the lab, you will be filling in the blanks based on variables in your own dataset, uncommenting just the one line in the code chunks that you filled in (according to the directions that I provide at the top of the code chunk), and then running the code.

Run the following code to load the libraries for this week.

#Run this code chunk to load the tidyverse package and the lubridate package.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.1     ✓ purrr   0.3.4
✓ tibble  3.0.1     ✓ dplyr   1.0.0
✓ tidyr   1.1.0     ✓ stringr 1.4.0
✓ readr   1.3.1     ✓ forcats 0.5.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Data Import

In some cases, you will be able to import your datasets directly from the sites where they are hosted. If you navigate to the portal where the data is hosted, and see a link to “Download” or “Export” the data as a CSV or a Spreadsheet, you can right click on that link and then select “Copy Link Address.” You’ll want to copy this link into the code below.

In portals run on Socrata, you will see an option to access the data via API. If you click on the API button, you will see a link to the API Endpoint. Be sure to select CSV from the dropdown next to this link before clicking Copy. You’ll want to copy this link into the code below. Note that each Socrata portal places limitations on the extent of data that can be accessed via API. If you notice that not all rows of the dataset are importing after running the code below, you will want to read through the API docs to determine the steps you would need to take to access the full database. I provide an example of this in the code below.

A final option is to download the data to your local machine and store it in the “datasets” folder in this repository. If you choose this option, you will be able to reference the location of the data with the path “datasets/[FILE.csv]”

For our example datasets:

  • The DHS has made the hospitals dataset available for download on the HIFLD site.
  • The New York Times team has made the Covid-19 cases dataset available in their own GitHub repo. I include the link to that data below.
  • The Center for Medicaid and Medicare Services has made the In-Patient Medicare Provider Utilization adn Paymnent data available for download on their open data portal, run by Socrata.

I can import all of these datasets into RStudio by reading these files in as a CSV, using the function read.csv(“[URL_OR_PATH_TO_DATE_FILE]”). When we read this data file into R, it will import as a data frame (reference lab 1 if you do not know what a data frame is). We will store that data frame in a variable using <- so that we may reference the dataset throughout the lab.

If you want to know why we are setting stringsAsFactors to FALSE …it’s a long story, which this blog tells much better than I can.

#Run this code chunk to import our example datasets.

hospitals <- read.csv("https://opendata.arcgis.com/datasets/6ac5e325468c4cb9b905f1728d6fbf0f_0.csv", stringsAsFactors = FALSE)

ipps <- read.csv("https://data.cms.gov/api/views/tcsp-6e99/rows.csv?$limit=200000", stringsAsFactors = FALSE)
#Notice how in the link above, I've included $limit=200000? This is because by default this Socrata portal only provides access to 100000 rows of data via API. After checking the number of rows in this dataset on Socrata (196,000), I checked the API docs to see how to increase the number of rows that could be accessed via the API and learned that I could include $limit=[NUMBER_OF_ROWS] at the end of the API Endpoint URL to increase the limit.

cases <- read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", stringsAsFactors = FALSE)
#Uncomment the line below and replace df with a variable name to describe your dataset and [URL_OR_PATH_TO_DATE_FILE] with the URL to your dataset or the local path to your dataset.  Be sure to remove the brackets but to keep the quotation marks. 

#df <- read.csv("[URL_OR_PATH_TO_DATE_FILE]", stringsAsFactors = FALSE)

Data Cleaning

What is the structure of your dataset?

str() gives us an overview of the structure of the dataset, including the number of observations, the variable names, and each variable’s type. For instance, check out how we would run str() on the hospitals dataset by running the code chunk below.

#Run this code chunk to see the structure of hospitals.

str(hospitals)

Run this function for your own dataset.

#Uncomment the line below by removing the '#' in front of the line and replace df with your own variable. Then run the code chunk to check out the structure. 

#str(df)

Are any of the variables in your dataset incorrectly data typed?

In R, the basic data types include:

  • numeric (num): numbers that may contain decimals
  • integer (int): whole numbers
  • character (chr): characters
  • logical (logi): TRUE/FALSE
  • complex: complex numbers

We can check the type of a variable as follows:

#Run this code chunk to check out the type of ID in the hospitals data frame.

#typeof(df$VARIABLE_NAME)

typeof(hospitals$ID)

There was only one variable in the hospitals dataset that we may want to convert to a different type. At first glance you may think its COUNTY_FIPS, which loaded as a character even though it looks like a number. We in fact want to keep this a character. County FIPS IDs have two parts - the first two digits represent a census-standardized state code and the second three digits represent a census-standardized county code. For example, California’s state code is 06, and Yolo County’s county code is 113, creating the FIPS code 06113. Just like we expect 5-digits in a postal code, systems that reference this number expect a certain number of digits. However, if the COUNTY_FIPS gets treated like a number, R will strip that leading zero in front of ‘6’ in California’s state code, and the COUNTY_FIPS will import as 6113. If instead, we treat the COUNTY_FIPS as a character, this leading zero won’t get stripped.

With this in mind, we can see above that ZIP imported as a number, and we are going to want to transform ZIP into a character and add leading zeros until the string is five digits long. We will do this in a later step. The code below shows you all of the ZIP codes in the dataset that are missing leading zeros because they were imported as a number.

#Run this code chunk.

hospitals %>%
  filter(ZIP < 10000) %>%
  select(NAME, ZIP) %>%
  head(10)

See how the ZIP codes above are not five digits? This is because they imported as numbers and their leading zeros were stripped.

When we call str(), we can see the data type of variable in the dataset. Scanning over the output of calling str() on your dataset, does it appear as though any of the variables are of the wrong type? For instance, values typed as chr that should be num? List any variables that are not the correct type.

Fill your response here. 

Do you have any variables in your dataset that should be numeric but are currently of type character?

This often happens when there are characters like commas, dollar signs, percent signs in the numeric column. We need to strip these characters before converting the variable to numeric.

Let’s remove the characters that are appearing in these column with the gsub function, which replaces a character with another character - in this case with nothing.

#Uncomment the last line, and fill the unwanted character, your data frame name, and the variable name. Copy and paste this line for each variable for which you need to substitute a character and fill accordingly. 

#df$VARIABLE_NAME <- gsub("UNWANTED CHARACTER", "", df$VARIABLE_NAME)
#_____$_____ <- gsub("_____", "", _____$_____)

Do you need to change the data type of any variables (including the char to numeric conversion you prepared for above)?

Let’s go ahead and convert that numeric ZIP variable in the hospitals dataset into a char by casting the variable as.character().

#Run this code chunk to convert ZIP from numeric to character.

hospitals$ZIP <- as.character(hospitals$ZIP) 

Following the same pattern, change the type of any incorrectly typed variables in your own dataset.

#Uncomment the last line, and fill the appropriate conversion type, your data frame name, and the variable name. Copy and paste this line for each variable that you need to convert to a different type and fill accordingly.

#df$VARIABLE_NAME <- as.numeric(df$VARIABLE_NAME) 

#Fill with one of the following: as.numeric, as.character, as.logical)

#_____$_____ <- _____(_____$______)

This will overwrite the values in that variable with the same values but of the correct type.

Do you need to add leading zeros to any values in your data?

As described above, sometimes we need to add characters to values in our dataset for that value to be an exact number of digits. For instance, ZIP codes should be 5-digits long regardless of whether they start with the number 0. After we’ve converted such numeric values to a character, we can pad the front of the string with a certain character until the string is the required length. For the hospitals dataset, we will need to add leading zeros to the ZIP codes we just converted to characters until they are all 5 characters in length. We can use str_pad() to do this. str_pad() is a function in the stringr package, which is included in the tidyverse. As arguments, it takes the vector of values you would like to pad, the number of characters that should constitute each value in that vector, and the character you would like to pad the value with.

# The code below takes the ZIP variable in the hospitals dataset and places the "0" character in front of each value in that variable until that value is 5 characters long. Run the chunk.

hospitals$ZIP <- str_pad(hospitals$ZIP, 5, pad = "0") 

hospitals %>%
  filter(ZIP < 10000) %>%
  select(NAME, ZIP) %>%
  head(10)

If needed, do the same to a variable in your own dataset. Otherwise, skip the code block below.

#Uncomment the last line, and fill the appropriate data frame, variable, and desired number of digits. 

#df$VARIABLE_NAME <- str_pad(df$VARIABLE_NAME, [number of digits], pad = "0") 

How are Null values represented in your dataset?

Null values should appear as a greyed-out and italicized NA (Not Available). This communicates to R that this is an empty value or, in other words, that there is not data here. However, if not properly formatted when we import the dataset, you may see Null values appear as:

  • “NULL”
  • empty strings ("")
  • “NONE”
  • “NOT AVAILABLE”
  • “N/A”

In the ipps dataset, there are no NA values. We can double check this by examining the first several rows of the data and calling colSums(is.na()).

#Run this code to see the number of null values in each column.

ipps %>% head(10)

colSums(is.na(ipps))
                            DRG.Definition                                Provider.Id                              Provider.Name 
                                         0                                          0                                          0 
                   Provider.Street.Address                              Provider.City                             Provider.State 
                                         0                                          0                                          0 
                         Provider.Zip.Code Hospital.Referral.Region..HRR..Description                           Total.Discharges 
                                         0                                          0                                          0 
                   Average.Covered.Charges                     Average.Total.Payments                  Average.Medicare.Payments 
                                         0                                          0                                          0 

Sometimes, when working with data, NA values will not appear in the first several rows. This is the case with the cases data. Run the code below, and in the tab that opens, scroll to row 417.

#Run this code to see the number of null values in each column.

View(cases)

Fortunately, the NA values in this dataset are coded correctly - as greyed out and italicized NA. Data dictionaries can sometimes be helpful in documenting how NA values are recorded but not always. Often you will need to do deeper investigation on your data.

In the hospitals dataset, we can see by calling head() that empty data is filled with the string “NOT AVAILABLE”. (Check out the variables ZIP4 and TELEPHONE below). We want to convert such values to NA values. We can also see that, in some cases, -999 POPULATION and BEDS are reported. This signals to us that -999 is being used to indicate that data is not available. We can confirm this in the data dictionary. In fact, quite often -999 is used to indicate null values.

#Run to check out the first ten rows of hospitals

hospitals %>% head(10)

We will select all of the values equal to (“==”) “NOT AVAILABLE” in the hospitals dataset and equal to (“==”) -999 in the hospitals dataset, and convert them to NA. Check out how I do this below.

#Run this code chunk to convert to NAs.

is.na(hospitals) <- hospitals == "NOT AVAILABLE"
is.na(hospitals) <- hospitals == -999

Check out how null values are appearing your own dataset by following the commented instructions below. Note that I’m assuming that a null value appears in the first ten rows of your dataset. This may not be the case, and if we discover so later in the lab, we may need to come back to this cleaning step.

#Uncomment the last line, and fill in your data frame name to view the first ten rows of the data frame.

#df %>% head(10)

#_____ %>% head(10) 

If necessary and where appropriate, convert values to NA in your own dataset by following the instructions below.

#Uncomment the last line, and fill in your data frame name to view the first ten rows of the data frame.

#is.na(df) <- df == "unwanted string"  

#For example, if "NULL" appears in your dataset:
#is.na(df) <- df == "NULL"

#is.na(_____) <- _____ == "_____" 

Do you have any variables in your dataset that refer to specific dates?

Dates can be converted to a date format using the lubridate package. This is a package in the Tidyverse that makes it possible to extract specific information (such as month or year) from dates, and to compute with dates.

The hospitals dataset has two date variables: SOURCEDATE and VAL_DATE. On import, they both have the following format: yyyy-mm-ddThh:mm:ss.000Z Because the format is in the year-month-day hour:minute:second format, we will call ymd_hms() on the variable.

#Run this code chunk to convert to date formats.

hospitals$SOURCEDATE <- ymd_hms(hospitals$SOURCEDATE)
hospitals$VAL_DATE <- ymd_hms(hospitals$VAL_DATE)

If they were instead listed in the month-day-year hour:minute:second format, we would instead call mdy_hms() on the variable.

If they were just in the year-month-day format, we would instead call ymd() on the variable. This is the case for the date variable in the cases dataset, so we will convert it below:

#Run this code chunk to convert to date formats.

cases$date <- ymd(cases$date)

Check out the format of your date. Is it just a year? Just a month? A year, month, and day? Are there times listed? What order are each of these values listed in? This link offers more information about how to structure date conversions. If the date is just a year, we can leave it as an int. Otherwise, we will convert the date to a date format. If you need to convert a date in your dataset, follow the instructions below:

#Uncomment the last line, and fill the appropriate data frame name, variable name, and date format. 

#df$VARIABLE_NAME <- date_format(df$VARIABLE_NAME)

#For example, if the date is in month day, year (March 1, 1999) format:
#df$VARIABLE_NAME <- mdy(df$VARIABLE_NAME) 

#_____$_____ <- _____(_____$_____)   

Data Exploration

At this point, we will begin exploring and getting to know your data. We will be learning a number of functions that are made available through dplyr - a package in the Tidyverse that enables us to manipulate and transform data. The four primary functions we will be working with this week and next through dplyr include:

  • select() : select variables
  • filter() : return only observations that meet a particular criteria
  • group_by() : group observations according to a common value
  • summarize() : perform an operation and return a single value

In this lab, we will focus on the first two - select() and filter(). You can think of select() as a tool to reference specific columns (or variables) in a rectangular dataset, and filter() as a tool to reference specific rows (or observations) in a rectangular dataset.

In the functions below, you are going to see the following set of characters often: %>% This is known in the Tidyverse as a ‘pipe’. A pipe connects different functions into one line of code (sort of like a conjunction in a sentence). You can think of the pipe as saying: “and then…” communicating to R that you are going tell it to do something else after the function we just called.

What kinds of variables are in the dataset?

To begin with, we are going to look at the variables in our dataset. You can check out the variables in your dataset in a number of ways, but perhaps the easiest way at this point will be to reference the ‘Environment’ tab in the upper right hand corner of RStudio. Click on the arrow next to your data frame name to see an expanded list of variables in your data frame. (Alternatively, you can call str(df) as we did above.)

Nominal categorical variables are variables that identify something else. They name or categorize something that exists in the world. Sometimes, nominal categorical variables are obvious. For instance, in the hospitals dataset, the hospital NAME is a nominal categorical variable - referring to the actual hospital. CITY is also a nominal categorical variable - referring to the hospital’s city. The hospital TYPE and OWNER are all nominal categorical variables - referring to specific categories the hospital is classed within. However, nominal categorical variables are not always strings. Sometimes, numbers are considered nominal categorical variables. For instance, a ZIP code is not a value that we operate on but instead refers to a certain place; it is a nominal categorical variable. In the hospitals dataset, the NAICS_CODE is a numeric reference to a particular industry classification; it is also a nominal categorical variable. Both OBJECTID and ID are nominal categorical variables referring to the hospital.

In some of your datasets, 0s and 1s will refer to ‘yes’ and ‘no’ in the dataset. This is another case where numbers refer to categorical variables. Always be sure to check if numbers are a tally or measurement of something or if they are referring to something else. If they are referring to something else, often they are a nominal categorical variable.

List three nominal categorical variables in your dataset. Use select() to select these variables in your dataset, and use head(10) to limit the display to the first 10 rows.

Note that you may not be able to list three of each below. This is fine.

#Uncomment the last line, and fill the appropriate data frame name and variable names for your own dataset. Run.

#df %>% select(VARIABLE_NAME1, VARIABLE_NAME2, VARIABLE_NAME3) %>% head(10)

#Here are just a few of the nominal categorical variables in the hospitals dataset
hospitals %>% select(OBJECTID, ID, NAME, COUNTY, NAICS_CODE, TYPE) %>% head(10)

#_____ %>% select(_____, _____, _____) %>% head(10)

Ordinal categorical variables are categorical variables that can be ranked or placed in a particular order. For instance, ‘High’, ‘Medium’, and ‘Low’ have a particular order. In the hospitals dataset, there is one ordinal categorical variable - TRAUMA, which characterizes the hospital’s trauma level designation. Trauma level designations indicate the extent of resources available at a hospital to deal with certain categories of trauma. It is most often broken into Level I through Level V. We can see how a data analyst may want to place trauma categories in a particular order (for instance, ordering hospitals from highest to lowest trauma levels). However, this is a particularly complicated categorical variable to work with. This is because Trauma levels are not defined according to a national standard. Instead, they are defined on a state-by-state basis, and our dataset spans all US states. Level II in one state might mean something different than Level II in another state despite both being labeled Level II in the dataset. Further, a single hospital can have multiple trauma levels (e.g. Level I Pediatric and Level II Adult). We would need to take all of this into consideration when comparing trauma levels across hospitals on a national scale.

List three ordinal categorical variables in your dataset. Use select() to select these variables in your dataset, and use head(10) to limit the display to the first 10 rows.

Note that you may not be able to list three of each below. Most of you do not have any ordinal categorical variables in your dataset. This is fine.

#Uncomment the last line, and fill the appropriate data frame name and variable names for your own dataset. Run.

#df %>% select(VARIABLE_NAME1, VARIABLE_NAME2, VARIABLE_NAME3) %>% head(10)

#Here is the only ordinal categorical variable in the hospitals dataset
hospitals %>% select(TRAUMA) %>% head(10)

#_____ %>% select(_____, _____, _____) %>% head(10)

Discrete numeric variables are numeric variables that represent something that is countable - the number of students in a classroom, the number pages in a book, the number of beds in a hospital. In the hospitals dataset, POPULATION, BEDS, and presumably TTL_STAFF (though it’s all empty in our dataset), are all discrete numeric variables because they represent things that have been counted.

List three discrete numerical variables in your dataset. Use select() to select these variables in your dataset, and use head(10) to limit the display to the first 10 rows.

Note that you may not be able to list three of each below. Some of you do not have any discrete numeric variables in your dataset. This is fine.

#Uncomment the last line, and fill the appropriate data frame name and variable names for your own dataset. Run.

#df %>% select(VARIABLE_NAME1, VARIABLE_NAME2, VARIABLE_NAME3) %>% head(10)

#Here are the discrete numeric variables in the hospitals dataset
hospitals %>% select(POPULATION, BEDS, TTL_STAFF) %>% head(10)

#_____ %>% select(_____, _____, _____) %>% head(10)

Continuous numeric variables are variables that would take an infinite amount of time to precisely count. You can think of these as variables in which it is always possible to measure the value more precisely. For instance, time would be considered a continuous numeric variable because time can be measured with infinite amount of specificity - hours > minutes > seconds > milliseconds > microseconds > nanoseconds … and so on. Ruler measurements are also continuous because they can also be measured with infinite more precision. In the hospitals dataset, both latitude and longitude are continuous numeric variables as we can always measure them with more precision.

While it’s a bit controversial, for the purposes of this assignment, we will also treat ratios as continuous data, so you may list those below.

List three continuous numeric variables in your dataset. Use select() to select these variables in your dataset, and use head(10) to limit the display to the first 10 rows.

Note that you may not be able to list three of each below. Some of you do not have any continuous numeric variables in your dataset. This is fine.

#Uncomment the last line, and fill the appropriate data frame name and variable names for your own dataset. Run.

#df %>% select(VARIABLE_NAME1, VARIABLE_NAME2, VARIABLE_NAME3) %>% head(10)

#Here are the continuous numeric variables in the hospitals dataset
hospitals %>% select(LATITUDE, LONGITUDE) %>% head(10)

#_____ %>% select(_____, _____, _____) %>% head(10)

What makes each observation in your dataset unique?

The last time that I ran this course, one student ran into some issues when trying to make sense of the values reported in her dataset. She was working with a dataset documenting counts of arrests in each California county each year according to the age, gender, and race/ethnic group of arrestee. Check out this data below. (Note that this may take a few moments to load.)

#Run.

ca_arrests <- read.csv("https://data-openjustice.doj.ca.gov/sites/default/files/dataset/2019-06/OnlineArrestDispoData1980-2018.csv", stringsAsFactors = FALSE)

ca_arrests %>% head()

She noted that there were multiple rows reporting different arrest counts in cases where all of the other variables seemed the same - in the same year, same county for the same gender, race, arrest disposition, and age group.

#Run.

ca_arrests %>%
  filter(YEAR == 2001 & COUNTY == "Sacramento County" & GENDER == "Male" & RACE == "White" & ARREST_DISP_CODE == "Released" & AGE_GROUP == "20 to 29")

See how above, in 2001 in Sacramento Couty white male individuals age 20 to 29 that were released had both 0 and 1 felony arrests? How was this possible? The only other values in the dataset were counts of arrests. There was nothing else in the dataset that could make each row unique. So was there 1 felony arrest for this group or 0? We went ahead and emailed OpenJustice - the program that had made the dataset available. I asked the following in an email:

“I’m writing to ask about some issues [a student] came across while analyzing the dataset. It appears that there are several rows in the dataset that report different numeric values but refer to the same set of categories. For example in the attached image, there are three rows that refer to 1980, Alameda County, Male, Other, Complaint Sought, and 18 to 19, but they all report different values. We were wondering if you could explain what makes these rows distinct so we have a better sense of whether it is appropriate to sum them.”

A week later, they responded: “The program that aggregates the raw Arrest Disposition data uses certain variables not present in the output file. This currently creates multiple rows when all other present variables are distinctly filtered for. The Summary Offense Category counts (F_TOTAL, M_TOTAL, etc.) must be summed up for the multiple rows present.”

In this case, we needed to transform the data - adding up the numeric values across rows in which all other categorical values were the same in order to account for other categorical variables that were not present in the public data. (We will go over how to do this data transformation next week.) We would need to do this so that every set of values reported in the data was reported according to a distinct observational unit, or in other words, so that we had a way of uniquely identifying what each row in the dataset referred to.

In starting our data analysis, we need to have a good sense of what each observation in our dataset refers to - or its observational unit. Think of it this way. If you were to count the number rows in your dataset, what would that number refer to? Consider our example datasets by running the code below (no need to fill in the blank at this point.)

# Remember that paste() allows you to create strings that concatenate other strings that you provide, along with other values. We separate all of the components of the string we wish to paste together with commas. We went over this lab 1. Run this code chunk.

paste("I have", nrow(hospitals), "unique _____ represented in my dataset.")
[1] "I have 7596 unique _____ represented in my dataset."
paste("I have", nrow(cases), "unique _____ represented in my dataset.")
[1] "I have 421390 unique _____ represented in my dataset."
paste("I have", nrow(ipps), "unique _____ represented in my dataset.")
[1] "I have 196325 unique _____ represented in my dataset."

Get this statement started for your dataset:

#Uncomment the last line and fill your data frame name in nrow. At this point, you need only fill in the FIRST blank line with your data frame name. Run the code chunk.

#paste("I have", nrow(_____), "unique _____ represented in my dataset.")

To figure out how to fill that second blank in the statement, it is often useful to identify a variable or set of variables that can serve as a unique key for the data. A unique key is a variable (or set of variables) that uniquely identifies an observation in the dataset. For example, in the ca_arrests dataset above, the unique key would be a long combination of variables (the year, county, gender, race, arrest disposition, and age group would uniquely identify each row only after we had transformed it). Think of a unique key as a unique way to identify a row and all of the values in it. There should never be more than one row in the dataset with the same unique key. A unique key tells us what each row in the dataset refers to.

In the hospitals dataset, the unique key is a bit more obvious. There is a variable called OBJECTID that uniquely refers to the geographic coordinates in the dataset, and there is a variable called ID that uniquely refers to the hospital in the dataset. We can confirm that these are indeed unique keys by counting the number of distinct() (or non-repeating) values in this variable and making sure it is equal to the number of rows in the entire dataset. If the distinct values in the variable is equal to the number of rows in the dataset, then we know that the key never repeats and that it can uniquely identify each row.

#Run.

# Count the distinct values in your unique key
n_unique_keys <- 
  hospitals %>% 
  select(ID) %>% 
  n_distinct()

# Count the rows in your dataset
n_rows <- nrow(hospitals)

# Make sure these numbers are equal
n_unique_keys == n_rows

Since the ID field refers to a specific hospital, in this dataset a hospital is what makes each observation unique. In other words, the dataset’s observation unit is a hospital. Now you can confidently say:

#Run.

paste("I have", nrow(hospitals), "unique hospitals represented in my dataset.")

Note that NAME is typically not an appropriate variable to use as a unique key. Let me provide an example to demonstrate this. When I worked for BetaNYC, I was trying to build a map of vacant storefronts in NYC by mapping all commercially zoned properties in the city, and then filtering out those properties where a business was licensed or permitted. This way the map would only include properties where there wasn’t a business operating. One set of businesses I was filtering out was restaurants. The only dataset that the city had made publicly available for restaurant permits was broken. It was operating on an automated process to update whenever there was a change in the permit; however, whenever a permit was updated, rather than updating the appropriate fields in the existing dataset, it was creating a new row in the dataset that only included the permit holder (the restaurant name), the permit type, and the updated fields. Notably the unique permit ID was not being included in this new row. We pointed this issue out to city officials, but fixing something like this can be slow and time-consuming, so in the meantime, we looked into whether we could clean the data ourselves by aggregating the rows that referred to the same restaurant. However, without the permit ID it was impossible to uniquely identify the restaurants in the dataset. Sure, we had the restaurant name, but do you know how many Wendy’s there are in NYC?

In the ipps dataset, the unique ID is less obvious. Because we have values reported for multiple diagnosis categories at multiple hospitals, we need to rely on two variables to signify what makes each observation unique - DRG.Definition and Provider.ID.

#Run. 

# Count the distinct values in your unique key
n_unique_keys <- 
  ipps %>% 
  select(DRG.Definition, Provider.Id) %>% 
  n_distinct()

# Count the rows in your dataset
n_rows <- nrow(ipps)

# Make sure these numbers are equal
n_unique_keys == n_rows
[1] TRUE

In this dataset a diagnosis category and unique provider is what makes each observation unique. In other words, the dataset’s observational unit is a diagnosis category for a given provider. Now you can confidently say:

#Run. 

paste("I have", nrow(ipps), "unique diagnosis categories for a given provider represented in my dataset.")
[1] "I have 196325 unique diagnosis categories for a given provider represented in my dataset."

In the cases dataset, the unique ID is also more complicated. cases counts are listed for each county in each state, each day since January 21, 2020, and some states have counties with the same name. Because of this we will need to reference the state, county code, and the date reporting to uniquely identify each row.

You may be wondering why we are not using the fips code to uniquely identify each state/county. The reason for this is that, for some reports, the county is unknown, and the fips code value is NA. In this case, the fips code will be reported for multiple counties as NA on the same day.

See below:

#Run. 

# Count the distinct values in your unique key
n_unique_keys <- 
  cases %>% 
  select(date, state, county) %>% 
  n_distinct()

# Count the rows in your dataset
n_rows <- nrow(cases)

# Make sure these numbers are equal
n_unique_keys == n_rows
[1] TRUE

In this case, every row in the dataset is a county, state, and date. In other words, the dataset’s observational unit is a county/state/date. Now you can confidently say:

#Run. 

paste("I have", nrow(cases), "unique counties/states/dates represented in my dataset.")
[1] "I have 267174 unique counties/states/dates represented in my dataset."

What variable or combination of variables makes each observation in your dataset unique? Confirm that you are correct below.

# Uncomment below and count the distinct values in your unique key. Note that you may need to select multiple variables. If so, separate them by a comma in the select() function.
#n_unique_keys <- _____ %>% select(_____) %>% n_distinct()

# Uncomment below and count the rows in your dataset by filling in your data frame name.
#n_rows <- nrow(_____)

# Uncomment below and then run the code chunk to make sure these values are equal.
# n_unique_keys == n_rows

What does your unique key refer to? In other words, what is the observational unit of your dataset?

Fill your response here. 

Fill in the statement below, and make sure that it makes sense with your data.

#Uncomment the line below and fill in both of the blanks. Run.

#paste("I have", nrow(_____), "unique _____ represented in my dataset.")

Defining Discrete Observational Units

Anytime we count something in the world, we are not only engaging in a process of tabulation; we are also engaged in a process of defining. If I count the number of students in a class, I first have to define what counts as a student. If someone is auditing the class, do they count? If I, as the instructor, am learning from my students, do I count myself as a student? As I make decisions about how I’m going to define “student,” those decisions impact the numbers that I produce. When I change my definition of “student,” how I go about tabulating students also changes. Thus, as we prepare to count observations in a dataset, it is important to know how those observations are defined. When I say that there are 7581 hospitals in the hospitals dataset, this number does not mean much until I understand how hospitals were defined in the dataset. Which hospitals? In what part of the world? From what time period? Are hospitals that were once open and are now closed included? Are psychiatric hospitals included? Are nursing homes included? Who gets to decide what counts as a hospital?

Analyzing the hospitals data documentation, we find the following statement:

“This feature class/shapefile contains locations of Hospitals for 50 US states, Washington D.C., US territories of Puerto Rico, Guam, American Samoa, Northern Mariana Islands, Palau, and Virgin Islands.The dataset only includes hospital facilities based on data acquired from various state departments or federal sources which has been referenced in the SOURCE field. Hospital facilities which do not occur in these sources will be not present in the database….The database does not contain nursing homes or health centers.”

Knowing how hospitals are defined helps us put the count of hospitals in our dataset into context. It is particularly significant to consider in the era of Covid-19 as policy-makers have debated whether to have nursing homes take on an overflow of Covid-19 patients from hospitals.

How are the observational units in your dataset defined? Note that if you have multiple variables constituting your observational unit, you may select just one to compose your response. Be sure to refer to the data documentation.

Fill your response here. 

Who or what organization manages these definitions? In other words, who gets to decide what counts in this data?

Fill your response here. 

What do our variables tell us about each observation?

Defining Key Variables

In addition to considering how observations in our dataset are defined, we also need to consider how variables in our dataset are defined. This is because the values reported in each variable of the dataset will depend, in part, on how we define that variable. The definitions of these variables should be documented in the data dictionary.

Consider the cases dataset. The cases variable is defined in the documentation as:

“cases: The total number of cases of Covid-19, including both confirmed and probable”

What does it mean that both confirmed and probable cases are included?

Later in the documentation, the New York Times team goes on: > “Confirmed cases and deaths are counts of individuals whose coronavirus infections were confirmed by a laboratory test and reported by a federal, state, territorial or local government agency.” > “Probable cases and deaths count individuals who did not have a confirmed test but were evaluated using criteria developed by states and the federal government.”

Up through April, the US CDC had only been reporting “confirmed cases.” However, on April 5, 2020, the Council of State and Territorial Epidemiologists Centers began recommending that states report both in order to account for highly likely cases that had not received a laboratory test. Some states have began reporting both confirmed and probable cases; other states didn’t. The New York Times can only report the numbers based on what is being reported, so they list in the documentation how each state is reporting their numbers. This is something we have to keep in mind when monitoring the differences in cases across states. Not every state defines a case in the same way!

Now consider the ipps dataset. In the data documentation for this dataset, a section called Population helps us discern the scope of the observations recorded in this data:

The Inpatient PUF includes data on FFS beneficiaries from inpatient providers that submitted Medicare Part A IPPS short term institutional claims during the fiscal year. To protect the privacy of Medicare beneficiaries, any aggregated records which are derived from 10 or fewer discharges are excluded from the Inpatient PUF.

Let’s unpack that a bit. In this case, FFS stands for fee-for-service, indicating an insurance plan in which a provider is paid separately for each service rendered. Medicare has four parts - Part A, Part B, Part C, and Part D. Part A is standard in Medicare coverage and covers short-term hospitalizations specifically (whereas Part C, which is not standard covers long-term in-patient stays). So this definition is telling us that this data covers each claim made by a Medicare Part A patient for a specific service. What is important to note from this definition is that the data is not representative of the payments made by the entire population the hospital serves. Specifically, it does not include any information about payments made by patients not covered by Medicare. It also does not include data about discharges with fewer than ten claims at a given hospital.

Choose a key variable in your dataset. How is this key variable in your dataset defined?

Fill your response here. 

Who or what organization manages these definitions? In other words, who gets to decide what counts in this data?

Fill your response here. 

Values in a Key Categorical Variable

When called on a specific variable, distinct() lists each of the unique values that appear within that variable. This can be useful for determining how different issues are classified in the data. n_distinct() counts the number of distinct values in a variable. This let’s us know how many categories we are dealing with. For instance, I can find out the distinct types of hospitals as well as how many distinct types there are by calling:

#Run the following code. 

#df %>% select(VARIABLE_NAME) %>% distinct()
hospitals %>% select(TYPE) %>% distinct()

#df %>% select(VARIABLE_NAME) %>% n_distinct()
hospitals %>% select(TYPE) %>% n_distinct()

Often in ethnography, it is our job to take something that seems obvious or familiar to us and to question it as if it were strange. When running the function above, we may ask why values are categorized the way that they are - even if those categories seem obvious at first glance.

For instance, in the hospitals dataset, we might ask why it is that we have separate categories for different types of hospitals. With just a bit of research, we find that there is rich history behind these hospital types. For instance, “critical access hospitals” was a designation created in 1997 to improve access to hospitals in rural parts of the US, following an almost two-decade long wave of hospital closures in rural communities. To receive federal funding, critical access hospitals should have no more than 25 beds. See this source. Psychiatric hospitals, while following a similar timeline to the development of general hospitals in the US, developed in response to changing attitudes and understandings of what it meant to be mentally ill. In the 18th century, mental illness was often considered a moral or spiritual shortcoming; however, the increased emphasis on moral treatment of mentally ill patients ushered in a new wave of institutions and wards devoted to the treatment of such patients. See this source.

Let’s check a second variable in the hospitals dataset.

#Run the following code. 

#df %>% select(VARIABLE_NAME) %>% distinct()
hospitals %>% select(OWNER) %>% distinct()
#df %>% select(VARIABLE_NAME) %>% n_distinct()
hospitals %>% select(OWNER) %>% n_distinct()

Upon running this, we might ask why these different hospital business models exist. With just a bit of research, we can find a history of hospital ownership. From this history, we can see that a number of cultural, political, and economic forces has shaped hospital ownership models. In other words, these categories have a rich cultural history and tell us not only about our data, but also about the cultural context in which data gets enumerated.

In some cases, like in the hospitals examples we just looked at, data will only be classified into a few categories. However, in other cases, data can classified into many more categories. Take the DRG.Definition variable in the ipps dataset. This variable includes a code and description for a Medicare Severity Diagnosis Related Group (MS-DRG) classification - which is a standard classification system for grouping diagnoses and procedures at hospitals to facilitate payments for hospital services. Let’s check out the categories in this variable.

#Run the following code. 

#df %>% select(VARIABLE_NAME) %>% distinct()
ipps %>% select(DRG.Definition) %>% distinct()
#df %>% select(VARIABLE_NAME) %>% n_distinct()
ipps %>% select(DRG.Definition) %>% n_distinct()
[1] 563

How did the creators of this classification system (Robert B. Fetter, PhD, and John D. Thompson, MPH, of Yale University) determine how to divide these categories? What counts as a distinct diagnosis or procedure? Why does it matter whether or not there were complications/co-morbidity (e.g. “w cc” above)? The original goal of creating these categories was to curb rising healthcare costs in the US in the 1980s. The categories grouped diagnoses and procedures so that the Center for Medicare and Medicaid Services could vary the rates at which they would reimburse patients for hospital stays according to the variations in the cost for each diagnosis. Throughout the 2000s, the classification was revised to account for varying severities of a particular diagnosis - including separate categories for a diagnosis when the patient experienced complications/co-morbitity (cc) or major complications/co-morbidity (mcc) - since these cases often incurred much higher costs.

Choose a categorical variable in your dataset to explore further. Be sure to select a variable in which the values represented in each row are likely to appear more than once. Select that variable and then call distinct() and n_distinct().

#Uncomment the appropriate lines below, and fill in your data frame and categorical variable name.

#Check the distinct values in the variable
#_____ %>% select(_____) %>% distinct()

#Check the number of distinct values in the variable
#_____ %>% select(_____) %>% n_distinct()

Reflect on the categorization. How are the categories divided? Do any of the categories surprise you? Why? In what ways do the categories reflect a particular cultural moment? Conduct a bit of Web research in order to better understand why they are divided the way that they are. Be sure to cite your sources.

Fill your response here. 

What values are missing?

We can check the number of NAs in each column in your dataset by summing the number number of NAs in each column with the function colSums().

#Run the following code to see how many NAs are in each column of cases.

#colSums(sapply(df, is.na))
colSums(sapply(cases, is.na))
  date county  state   fips  cases deaths 
     0      0      0   2785      0      0 
#Run the following code to see how many NAs are in each column of ipps.

#colSums(sapply(df, is.na))
colSums(sapply(ipps, is.na))
                            DRG.Definition                                Provider.Id                              Provider.Name 
                                         0                                          0                                          0 
                   Provider.Street.Address                              Provider.City                             Provider.State 
                                         0                                          0                                          0 
                         Provider.Zip.Code Hospital.Referral.Region..HRR..Description                           Total.Discharges 
                                         0                                          0                                          0 
                   Average.Covered.Charges                     Average.Total.Payments                  Average.Medicare.Payments 
                                         0                                          0                                          0 

Check the number of NAs in each variable in your dataset by filling in the blanks in the commented code below.

#Uncomment the appropriate lines below, and fill in your data frame. Run.
#colSums(sapply(_____, is.na)) 

Let’s explore a variable with many NAs. This is going to be the first time we see the function filter(). filter() subsets our data to the observations (or rows) that meet a certain criteria. Below, we will filter our data to those observations in which a certain variable is an NA. However, we can filter by a number of criteria; for instance, we can filter to those rows with a variable that:

  • equals a particular value: == “VALUE”
  • is less than a particular value: < VALUE
  • is greater than a particular value : > VALUE
  • is less than or equal to a particular value: <= VALUE
  • is greater than or equal to a particular value: >= VALUE
  • is one of a vector of values: %in% c(VALUE1, VALUE2)

Here is how we filter data to the rows in which a certain variable is an NA. I also call head(30) to display the first 30 rows in the dataset.

Let’s start with the cases dataset.

#Run the following code to filter to rows with NA values. 

#df %>% filter(is.na(VARIABLE_NAME)) %>% head(10) #We add head(10) to limit our output to the first ten rows

cases %>% filter(is.na(fips)) %>% head(30)