Instructions and Overview

In this lab, you will begin to outline a research plan for a data ethnography and analysis project. In the first part of this lab, you will document what you currently know about a topic, along with some research questions that, when addressed, could further your understanding on the topic. In the second part of the assignment, you will begin searching for data that could help address some of the empirical research questions that you devise. Throughout this document, you will be expected to fill responses to prompts into code chunks that have been set to not evaluate when run. You will also be expected to fill in a series of tables. Be sure to reference lab 1 if you are unsure how to go about filling in these sections.

Topic Selection

I have taught this data ethnography and analysis project in a UC Davis course called Data Sense and Exploration twice now, and each time, to help students scope research topics, I have proposed a theme. In Spring 2019, the theme was “social and environmental challenges facing California.” In Spring 2020, the theme was “US social vulnerability in the wake of a pandemic.” In these courses students have selected topics such as:

  • the extent of student debt post-graduation
  • social support for homelessness in major cities
  • changes in avian migratory patterns
  • the impact of industrial GHG emissions on marginalized communities
  • FEMA response to wildfire damage
  • differences in eligibility for supplemental nutrition programs across the US
  • changes domestic violence prevalence over time
  • access to broadband and Internet infrastructure in NYC

The notebooks in this Lab Book are designed to support data research on just about any topic. With this in mind, if you are not already coming to this Lab Book with a particular topic in mind, as you go about selecting a topic, you may consider what issues most matter to you, your family, and your community in the contemporary moment.

What topic am I investigating? (just one phrase here is fine.)

Fill response here. 

Why is this topic important to study right now?

Fill response here. 

Project Contexts

At this point, we are going to begin mapping out some of the project contexts. By this I mean that we are going to put your topic into temporal, geographic, and social-cultural perspective. Imagine that your topic is somehow depicted in the center of a plain white sheet of paper. What details would we need to fill into the background in order to bring this topic to life? We would need to add people. We would need to add environments. We would need to communicate the time and place in which the topic was being depicted. The eight questions below encourage you to draw out these contexts. Note that there is not necessarily a wrong way to answer these questions, and you absolutely do not need to answer them comprehensively. If we tried to respond to them “fully,” we’d probably go on writing forever! Instead I’d like you curate just a few things that come to mind when you consider your topic. Having some of these contexts written out will help you as you are searching for data. For each question below, please respond in 1-2+ complete sentences.

What are some key events, dates, or years relevant to this topic? This might be a long span of time or a specific event.

Fill response here. 

What are some key sites, locations, or geographies relevant to this topic? This might be a large boundary like a country or a small community.

Fill response here. 

What social groups are impacted by this topic, and how?

Fill response here. 

How is the environment implicated in this topic?

Fill response here. 

How is the economy implicated in this topic?

Fill response here. 

How are politics implicated in this topic?

Fill response here. 

How are you implicated in this topic?

Fill response here. 

What are some of the most common ways different social groups talk about this topic?

Fill response here. 

Research Questions

Empirical research questions are questions that an analyst can assess evidence to address. Examples of social-theoretical questions include:

  • Do United States hospitals have enough beds to accommodate the expected influx in Covid-19 cases? At what point will there not be enough beds available?
  • In which US states are hospitals better equipped with beds and staff to take on an influx in patients over the course of the next month?

Empirical research questions are specific to a particular time and place. Notice how I delimit my questions to the US above and the second question to a specific month. Now think about your topic. What questions could you ask about your topic that would contribute to an understanding of the topic’s prevalence, how the topic impacts diverse communities, how the topic has changed over time, or how communities are equipped to respond to the topic? Be sure that your question is one that you can assess evidence to address, and that it is specific to a certain time and place. If your topic is mental health, you might ask:

  • In 2020, how many people in the United States have insurance that covers therapy visits?

Note how, with the right data I could answer this question definitively.

What empirical research questions might I address in my research? List at least two.

Fill response here. 

Beware of setting up your question as a dichotomy! Did you use the word ‘or’ in your question above? I’ve seen many students do this in past assignments - asking questions like “Was this legislation beneficial to local communities, or was it harmful?” or “Did this technology fix inequities in the community, or did it sustain them?” In each of these questions, we have structured our research to test two conditions only. Yet, when it comes to studying complex contemporary issues, things are never black and white. It is highly likely that the situation we are examining in our research is much more complicated than these two conditions can capture. We should avoid structuring our research questions to test mutually exclusive categories. In placing false dichotomies in our research questions, we run the risk of oversimplifying complex causation, and we limit what the research can say.

Social-Theoretical questions are questions about how certain social, political, or environmental phenomena operate at a broad scale. They are much broader than empirical research questions. We often cannot answer these questions with one research project. However, we often aim to increase understanding of these questions through a research project. Examples of social-theoretical questions include:

  • How has the United States prioritized critical health infrastructure?
  • How do national hospital business models implicate social vulnerability in the wake of a public health crisis?
  • How are existing structural inequalities exacerbated in a pandemic?

Note how for each of these questions I would need to examine lots of different data and carry out a number of different projects to answer them.

Now think about your topic. What broader questions about how social, political, or environmental phenomena operate might you wish to address through research?

What social-theoretical questions might I address in my research? List at least two.

Fill response here. 

What “ideal datasets” would help me address my empirical research questions?

Fill in the table below with at least 5 datasets that would help you address your empirical research questions. Be specific, filling out their geographic scope and the timespan they would cover. In the first column characterize the topic of the dataset. In the second and third column, describe the geographic and temporal scope. Finally, in the last column, mark whether you believe such a dataset is accessible.

Dataset Geography Timespan Do you think this data is accessible?
Number of confirmed cases of Covid-19 All countries across the globe January 2020 to present Yes
Fill Fill Fill Fill
Fill Fill Fill Fill
Fill Fill Fill Fill
Fill Fill Fill Fill
Fill Fill Fill Fill

Background on Open Government Data

In May 2009, Data.gov - a web portal for accessing US government datasets - was launched by then federal Chief Information Officer Vivek Kundra. Following this, in December 2009, then US President Barack Obama signed the Open Government Data Directive, requiring that all federal agencies post at least 3 high value datasets on data.gov within 45 day. A few years later in May 2013, Pres. Obama signed an Executive Order to: “M”ak[e] Open and Machine Readable the New Default for Government Information."

The Order required that the US Office of Management and Budgeting, in collaboration with the CIO and CTO, put out and oversee an Open Data Policy. This policy required the following:

  • Data needs to be published in machine-readable formats
  • Data needs to be licensed openly
  • Data needs to be described with metadata

See here for more details.

Note that these are only requirements for data produced through the federal government. Cities, states, and counties have their own open data programs, policies, and laws, which are sometimes more and sometimes less stringent than the federal policy. However, most open data policies, in some way, deal with the three issues listed above - machine-readability, licensing, and metadata.

Let’s talk about what each of these issues entail:

Machine-readable

Last quarter, in my Hack for California research cluster, one project was examining gentrification over the past ten years around and near the UC Davis Medical Center in Sacramento. One of the indicators we were examining in relation to gentrification was the number of construction permits the city had awarded in that area for demolitions, new buildings, and remodeling. The City of Sacramento has construction permit data from 2012 to present stored in Excel files on their website - one file of permits per month. One of the students in the research cluster (and in fact one of your stellar classmates) was able to write a script to download each of these files and bind them into one large file. However, we were examining gentrification over a much longer period of time and needed construction permits dating back to 1990. We knew that the City had this data because they had produced a public map, where you could search for an address in Sacramento and retrieve every construction permit it had been awarded since the early 1980s. We needed a data file that listed this for every address in the city. With this in mind, I submitted a public records request asking for the following: “I’m looking for a data file of all construction permits issued across Sacramento from 1990 to present in a downloadable, machine-readable format.”

Two weeks later they sent me back a 9,543-page PDF document listing every construction permit awarded in the city since 1990. I took a deep breath. Had they sent this in a CSV file, we could have gotten to work immediately. It would take hours to get this hefty document into a format we could work with. I really hate PDFs.

Machine-readable data are data that can be readily processed by a computer. Typically machine-readable data are structured in ways that many different computer applications can recognize. As my story above indicates, there are different degrees of machine-readability of digital data:

  • Data stored in tables in PDFs (files with extension .pdf) are perhaps some of the least machine-readable data that we will encounter in our data research. This is because data in tables in PDFs are not structured in such a way that computer software can easily reference and operate on specific values.
  • Data stored in Excel (files with extension .xls) files are a bit more machine-readable. We can import data stored into an Excel file into Excel, and the software will recognize how the data are structured, aggregate values into separate cells, and thus present it to us in a way that we can operate on it. However, to open Excel, we need access to Microsoft products, which we have to pay for. This means that the Excel file is not stored in an open format.
  • Data stored in Comma Separated Value (files with extension .csv) files are more machine-readable than data stored in Excel files. This is because data in CSV files are structured in an open format - or a format that is not dependent on proprietary software. A CSV file is a text file structured so that each line in the file designates a record and various fields for describing that record are separated by commas. If we were to open a CSV file in a basic text editor, it would look something like this:

Name, Age, Birth Month, Time on Phone

Sally, 23, 3, 42

Julie, 40, 2, 98

Mark, 14, 8, 120

However, if we were to open the same file in Excel, each value would be separated into its own cell. A CSV file is software independent. As a standardized way of displaying data, just about any computer application that displays data is prepared to read a CSV file and format it for display.

While we won’t work with such formats in this course, data can be made even more machine-readable than a CSV file. Formats like XML and RDF allow us to structure data with much more specificity. They are often considered the gold standards of machine-readability. Sir Tim Berners-Lee, the inventor of the World Wide Web, often uses this diagram to outline the degrees of machine-readability of open data (if you want to really get me going, ask me the difference between the Internet and the Web!):

5-star open data

In the above chart, the acronyms are short for the following:

  • OL: Available online
  • RE: Machine-readable
  • OF: Open format
  • URI: Each observation has a unique resource identifier so that we may refer to it persistently
  • LD: Data is linked to other datasets

Licensed Openly

Just because something is available on the Web does not necessarily mean that we are free to download and use it as we please. Historically, different government agencies would allow access to certain datasets for a fee that would help to cover the costs of running public data programs. (Oftentimes, we hear this referred to as data being behind a “paywall.”) With the US Open Government Directive, all data produced by any federal US agency would default to the public domain. When data is in the public domain, they are owned by the the public. The data are not subject to any copyright or intellectual property law and can be accessed, modified, reproduced, and distributed without any restrictions.

Data acquired by any federal US agency needed to be given an open data license that met the following criteria:

  1. The license allowed for data reuse and modifications.
  2. The license could not restrict any form of redistribution of the data.
  3. The license could not discriminate any person or group from these rights.

There are a few global licenses that government agencies can apply to data that meet this criteria. One such license is the Creative Commons Universal Public Domain License (CC0 1.0).

Creative Commons is a non-profit organization that aims to increase the availability of creative works that the public can remix and share. Creative Commons has created a number of free licenses that the public can apply to their own creative works in order to designate the extent to which others can modify and redistribute them. These licenses indicate whether individuals other than the content creator may share the work, remix the work, and/or make money off of the work, along with whether such individuals have to attribute the content creator when sharing it. The following image outlines a number of Creative Commons licenses from most open to least open. You’ll notice that CC0 1.0 - the license compatible with each of the criteria listed above is at the top of the image.