Datasets

Click on the title links to download the data. Please contact me before the class in which data will be used if you have any problems downloading!

Class exercises

Download Week 1

Download Week 3

Download Week 4

Download Week 5

Download Week 6

Download Week 7

Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:

Download Week 8

Download Week 9

Download Week 10

Download Week 11

Download Week 13

Dowload Week 14

Final projects

These datasets are suggestions, in which there are definitely stories to be found and visualized. But you are encouraged to work on other datasets.

Baseball statistics

Lahman’s Baseball Database contains a wealth of data on players, managers and teams from 1871 to 2014. Download the data in a series of CSV files from here.

This file documents the tables and fields, and how the tables should be joined together. For the player tables playerID is the unique code for each player that can be used to make joins. When loading this data into SQLite, this field can be used as a primary key for the tables in which it appears. For tables relating to teams or managers, you should create a new primary key, as we did in week 5 for the FDA data.

The Lahman database is also available as an R package.

# install Lahman package
install.packages("Lahman")

# load the Lahman package
library(Lahman)

# view, for example, the Master table
View(Master)

Although you will not see them as objects in your Environment tab in RStudio, each of the tables in the database is now available as a data frame. If you wish, you can convert them into objects in your local environment with something like the following code:

master <- Master

You can use the dplyr package to join, filter and aggregate the data as required.

North Atlantic storms

The file storms.csv contains data on tropical storms and hurricanes compiled by the Hurricane Research Division of the U.S. National Oceanic and Atmospheric Administration. I have processed the raw data to give the following fields:

This file contains data on storms from 1851 to 2015. However, you may wish to restrict your visualizations to storms from 1990 and later, as data on storms before the modern satellite era is less reliable.

This collection of data is good for mapping. If you need shapefiles for context and basemaps, try Natural Earth. These shapefiles each come with a README.html file that can be opened in a web browser for more information.

Wealth and well-being of nations

In its World Development Indicators, the World Bank has a trove of data on many aspects of countries’ wealth and well-being: There are many stories to be told from this data.

You can download data for individual indicators, or read data directly into R using the WDI package. (Remember that you will need to convert data you download from the World Bank site from wide to long format; the R package will give you data in the correct long format.)

In week 5, we saw how to download Gapminder data in bulk. Its data library includes some measures not available from the World Bank, so if you cannot find the data you want among the World Bank Indicators, try searching for it at Gapminder’s data download page.

Global Terrorism Database

Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.

You can download the data from here, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.

The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field. I can help with this.

You can then export as a CSV for analysis, visualization, and mapping.

Do take care to read the Terms of Use and instructions for citing the source of the GTD data.

California traffic accidents

The Transportation Injury Mapping System details injury and fatal traffic accidents for the whole of California. The data comes from the California Highway Patrol’s Statewide Integrated Traffic Records System and was then geocoded for mapping by UC Berkeley’s Safe Transportation Research & Education Center.

You will need to create an account. I can help if you have problems querying and downloading data,.

The codebook explains the fields in these tables, and how they should be joined.