Click on the title links to download the data. Please contact me before the class in which data will be used if you have any problems downloading!
mlb_salaries_2014.csv
Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.
disease_democ.csv
Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.
gdp_pc.csv
World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.
nations.csv
Data from the World Bank Indicators portal, which is an incredibly rich resource. Contains the following fields:iso2c
iso3c
Two- and Three-letter codes for each country, assigned by the International Organization for Standardization.country
Country name.year
population
Estimated total population at mid-year, including all residents apart from refugees.gdp_percap
Gross Domestic Product per capita in current international dollars, corrected for purchasing power in different territories.life_expect
Life expectancy at birth, in years.population
Estimated total population at mid-year, including all residents apart from refugees.birth_rate
Live births during the year per 1,000 people, based on mid-year population estimate.neonat_mortal_rate
Neonatal mortality rate: babies dying before reaching 28 days of age, per 1,000 live births in a given year.region
income
World Bank regions and income groups, explained here.index.html
index2.html
Two simple webpages, which we will edit and publish on GitHub.css
fonts
js
Folders with files to run the Bootstrap web framework.oil_production.csv
Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
ucb_stanford_2014.csv
Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
urls.xls
A spreadsheet that we’ll use in webscraping.
nations.csv
As for week three but lacking data on life expectancy, to allow you to practice data processing and a data join.Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:
pfizer.csv
Payments made by Pfizer to doctors across the United States in the second half on 2009. Contains the following variables:
org_indiv
Full name of the doctor, or their organization.first_plus
Doctor’s first and middle names.first_name
last_name
. First and last names.city
state
City and state.category of payment
Type of payment, which include Expert-led Forums
, in which doctors lecture their peers on using Pfizer’s drugs, and `Professional Advising.cash
Value of payments made in cash.other
Value of payments made in-kind, for example puschase of meals.total
value of payment, whether cash or in-kind.fda.csv
Data on warning letters sent to doctors by the U.S. Food and Drug Administration, because of problems in the way in which they ran clinical trials testing experimental treatments. Contains the following variables:
name_last
name_first
name_middle
Doctor’s last, first, and middle names.issued
Date letter was sent.office
Office within the FDA that sent the letter.disease_democ.csv
Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries, as used in week 1.food_stamps.csv
U.S. Department of Agriculture data on the number of participants
, in millions, and costs
, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015. kindergarten.csv
Data from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015. Contains the following variables:district
School district.sch_code
Unique identifying code for each school.pub_priv
Whether school is public or private.school
School name.enrollment
Number of children enrolled.complete
Number of children with complete immunizations.start_year
Year of entry (for the 2015-2016 school year, for example, this would be 2015).nations.csv
Data from World Bank World Development Indicators portal, giving data on population, GDP per capita, life Expectancy, birth rate, neonatal mortality rate, region and income group for the world’s nations, from 1990 onwards, as used in week 3.sf_test_addresses.tsv
Text file with list of 100 addresses in San Francisco, for geocoding exercise.refine_geocoder.json
JSON file to geocode using Open Refine.sf_test_addresses_short.tsv
The first 10 addresses from the previous file.ca_healthcare
ca_counties_medicare
Shapefile with data on Medicare reimbursement per enrollee by California county in 2012, from the Dartmouth Atlas of Healthcare. healthcare_facilities.csv
Locations and other data for hospitals and other healthcare facilities in California, from the California Department of Public Health. I have geocoded those facilities that lacked latitude and longitude coordinates in the raw data.gdp_pc
gpd_pc.csv
gdp_pc.csvt
CSV file with World Bank data on GDP per capita for the world’s nations in 2014, plus ancillary file for QGIS to understand the data types for each field.ne_50m_admin_0_countries
Natural Earth shapefile with boundary data for the world’s nations.seismic_risk
U.S. Geological Survey shapefile detailing the risk of experiencing a major earthquake across the continental United States.sf_test_addresses
Shapefile derived from the addresses we geocoded in week 9.ca_healthcare
ca_counties_medicare.zip
Zipped shapefile with data on Medicare reimbursement per enrollee by California county in 2012, as used in week 10.healthcare_facilities.csv
Locations and other data for hospitals and other healthcare facilities in California, as used in week 10.seismic_risk.zip
Zipped shapefile detailing the risk of experiencing a major earthquake across the continental United States, as used in week 10.sf
sf_test_addresses.csv
Sample of 100 addresses in San Francisco, geocoded in week 9.sfpd_stations.zip
Zipped shapefile with locations of San Francisco police stations.nations.csv
Data from the World Bank Indicators portal, as used in week 3 and subsequently.food_stamps.csv
U.S. Department of Agriculture data on the number of participants, in millions, and costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015, as used in week 8.seismic_risk_clip
Folder containing U.S. Geological Survey shapefile, detailing the risk of experiencing a major earthquake, clipped to the boundaries of the continental United States.nations.csv
Data from the World Bank Indicators portal, as used in week 3 and subsequently.warming.csv
NASA data on the annual average global temperature, from 1880 to 2015, compared the the average from 1951-1980.maps
Folder containing individual frames, from 1880 to 2015, each showing annual average temperatures across the globe, from 1880 to 2015, again compared the the average from 1951-1980.charts
combined
Empty folders into which we will save individual frames from which to make a video animation.These datasets are suggestions, in which there are definitely stories to be found and visualized. But you are encouraged to work on other datasets.
Lahman’s Baseball Database contains a wealth of data on players, managers and teams from 1871 to 2014. Download the data in a series of CSV files from here.
This file documents the tables and fields, and how the tables should be joined together. For the player tables playerID
is the unique code for each player that can be used to make joins. When loading this data into SQLite, this field can be used as a primary key for the tables in which it appears. For tables relating to teams or managers, you should create a new primary key, as we did in week 5 for the FDA data.
The Lahman database is also available as an R package.
# install Lahman package
install.packages("Lahman")
# load the Lahman package
library(Lahman)
# view, for example, the Master table
View(Master)
Although you will not see them as objects in your Environment
tab in RStudio, each of the tables in the database is now available as a data frame. If you wish, you can convert them into objects in your local environment with something like the following code:
master <- Master
You can use the dplyr package to join, filter and aggregate the data as required.
The file storms.csv
contains data on tropical storms and hurricanes compiled by the Hurricane Research Division of the U.S. National Oceanic and Atmospheric Administration. I have processed the raw data to give the following fields:
name
Official name for each storm; unnamed storms are listed as Unnamed
and also numbered.year
month
day
hour
minute
Date and time fields for each observation. For recent storms, observations are recorded every six hours.timestamp
Date and time fields combined into a full timestamp for each observation in standard YYYY-MM-DD HH:MM
format.record_ident
The entry L
indicates the time at which a storm made landfall, defined as the center of the system crossing a coastline, recorded from 1991 onwards. Other entries are explained in the file newhurdat-format.pdf
.status
Options include HU
for hurricane, TS
for tropical storm and TD
for tropical depression. Other entries are explained in newhurdat-format.pdf
.latitude
longitude
Geographic coordinates for the center of the system at each observation.max_wind_kts
max_wind_kph
max_wind_mph
Maximum sustained wind for each observation.min_press
Minimum air pressure at the center of the system for each observation.This file contains data on storms from 1851 to 2015. However, you may wish to restrict your visualizations to storms from 1990 and later, as data on storms before the modern satellite era is less reliable.
This collection of data is good for mapping. If you need shapefiles for context and basemaps, try Natural Earth. These shapefiles each come with a README.html
file that can be opened in a web browser for more information.
In its World Development Indicators, the World Bank has a trove of data on many aspects of countries’ wealth and well-being: There are many stories to be told from this data.
You can download data for individual indicators, or read data directly into R using the WDI package. (Remember that you will need to convert data you download from the World Bank site from wide to long format; the R package will give you data in the correct long format.)
In week 5, we saw how to download Gapminder data in bulk. Its data library includes some measures not available from the World Bank, so if you cannot find the data you want among the World Bank Indicators, try searching for it at Gapminder’s data download page.
Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.
You can download the data from here, selecting the Download full GTD dataset
option. An extensive codebook details all of the fields in the data.
The data is provided as a series of spreadsheets in .xlsx
format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD
format. This can be done from the eventid
field. I can help with this.
You can then export as a CSV for analysis, visualization, and mapping.
Do take care to read the Terms of Use and instructions for citing the source of the GTD data.
The Transportation Injury Mapping System details injury and fatal traffic accidents for the whole of California. The data comes from the California Highway Patrol’s Statewide Integrated Traffic Records System and was then geocoded for mapping by UC Berkeley’s Safe Transportation Research & Education Center.
You will need to create an account. I can help if you have problems querying and downloading data,.
The codebook explains the fields in these tables, and how they should be joined.