Datasets

Click on the title links to download the data. Please contact me before the class in which data will be used if you have any problems downloading!

Class exercises

Download Week 1

Download Week 3

Download Week 4

Download Week 5

Download Week 6

Download Week 7

Download Week 8

Download Week 9

Download Week 10

Download Week 11

Download the data from this session from here, unzip the folder and place it on your desktop. It contains the following folders and files:

Download Week 13

Download Week 14

Final projects

These datasets are suggestions, in which there are definitely stories to be found and visualized. But you are encouraged to work on other datasets.

Baseball statistics

Lahman’s Baseball Database contains a wealth of data on players, managers and teams from 1871 to 2015. Download the data in a series of CSV files from here.

This file documents the tables and fields, and how the tables should be joined together. For the player tables playerID is the unique code for each player that can be used to make joins.

The Lahman database is also available as an R package.

# install Lahman package
install.packages("Lahman")

# load the Lahman package
library(Lahman)

# view, for example, the Master table
View(Master)

Although you will not see them as objects in your Environment tab in RStudio, each of the tables in the database is now available as a data frame. If you wish, you can convert them into objects in your local environment with the following code:

master <- Master

You can use the dplyr package to join, filter and aggregate the data as required.

North Atlantic storms

The file storms.csv contains data on tropical storms and hurricanes compiled by the Hurricane Research Division of the U.S. National Oceanic and Atmospheric Administration. I have processed the raw data to give the following fields:

This file contains data on storms from 1851 to 2016. However, you may wish to restrict your visualizations to storms from 1990 and later, as data on storms before the modern satellite era is less reliable.

This collection of data is good for mapping. If you need shapefiles for context and basemaps, try Natural Earth. These shapefiles each come with a README.html file that can be opened in a web browser for more information.

Wealth and well-being of nations

In its World Development Indicators, the World Bank has a trove of data on many aspects of countries’ wealth and well-being: There are many stories to be told from this data.

You can download data for individual indicators, or read data directly into R using the WDI package. (Remember that you will need to convert data you download from the World Bank site from wide to long format; the R package will give you data in the correct long format.)

In week 5, we saw how to download Gapminder data in bulk. Its data library includes some measures not available from the World Bank, so if you cannot find the data you want among the World Bank Indicators, try searching for it at Gapminder’s data download page.

Global Terrorism Database

Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 170,000 terrorist attacks from 1970 to 2016. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.

You can download the data from here, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.

The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field. I can help with this.

You can then export as a CSV for analysis, visualization, and mapping.

Do read the Terms of Use and instructions for citing the source of the GTD data.

California traffic accidents

The Transportation Injury Mapping System details injury and fatal traffic accidents for the whole of California. The data comes from the California Highway Patrol’s Statewide Integrated Traffic Records System and was then geocoded for mapping by UC Berkeley’s Safe Transportation Research & Education Center.

You will need to create an account. I can help if you have problems querying and downloading data,.

The codebook explains the fields in these tables, and how they should be joined.