Principles of mapping

Maps are staples of journalism: The basic need to show the “where” of stories means that they are used frequently. Unlike some chart types, such as bubble charts or network diagrams, which may require some explanation to the uninitiated, maps need no introduction — we are all familiar with using them to navigate. Indeed, thanks to the smartphone revolution, many of us now carry sophisticated interactive mapping apps everywhere we go.

Maps can also be used to visualize data, which will be our main focus in the coming classes as we process geodata and learn how to display it on both static and online maps. Before we get into the practical details of making maps, we will cover some basic principles of mapping, and good practice in mapmaking.

Latitude and longitude

Consider the following concepts in relation to this image:

(Source: Google Earth)

When plotting points on a map, you will usually need to know their latitude and longitude. Latitude and longitude is a geographic coordinate system that enables every location on the Earth’s surface to be defined by two numbers. Latitudes are angular distances, given in degrees from 0 to 90, which define how far North or South a point is from the Equator. Longitudes are angular distances, given in degrees from 0 to 180, which define how far East or West a point is from a line running from the North to the South Pole through the Royal Observatory in Greenwich, London. Lines of equal latitude are known as parallels, while lines of equal longitude are called meridians.

Degrees of latitude or longitude can be subdivided into minutes and seconds (sometimes called the DMS system), or can be given as decimals. There are 60 minutes in a degree, and 60 seconds in a minute; the symbols for degrees, minutes and seconds are: °, ' and ". In decimal format, points North of the Equator are given as positive values, while those South of the Equator are negative. Similarly, for longitude, points to the East of the Prime Meridian that runs through Greenwich are positive, while those to the West are negative.

To understand how this works, consider the location of the UC Berkeley Graduate School of Journalism. Its latitude and longitude coordinates are 37.8749998 and -122.2596684, which can also be written as 37° 52' 30.0" N , 122° 15' 34.8" W. If you were to draw a line from the center of the Earth to the J-School, and then draw another to the Equator at the same longitude, the angle between them would be 37.8749998 degrees. If you were to take a slice of the Earth at this latitude, parallel to the equator, and draw two lines from the center of this slice, one to the Prime Meridian, the other to the J-School, the angle between them would be -122.2596684 degrees.

(Various online services support conversion from DMS to digital latitudes and longitudes, and vice versa — the two links given are free to use, and will process many thousands of records at a time.)

There are 360 degrees in a full circle, which explains why longitude goes from 0 to 180 degrees both East and West. Similarly, moving from the North to the South Pole means travelling half way round the Earth’s circumference, which is why latitude goes from 0 to 90 degrees both North and South.

Two points separated by one degree of latitude, lying at the same longitude, will always be separated by about 69 miles, because meridians are always the same size, representing half the circumference of the Earth. However, parallels decrease in size as we move nearer to the poles. At the Equator, one degree of longitude again corresponds to a linear distance across the Earth’s surface of about 69 miles. But at 45 degrees latitude North or South, you would need to travel just 49 miles to cover one degree of longitude.

Geocoding

Often when starting a mapping project, you may need to convert a series of addresses into latitudes and longitudes so they can be placed on your map. This is called geocoding.

There are several geocoding APIs, which can be accessed in various ways. The number of requests allowed per day and the terms of use vary from service to service: Google’s free service, for instance, allows each user to geocode 2,500 addresses per day, and specifies that the resulting coordinates may only be used to make a Google Map.

Because of this restriction, we will instead use the services offered by Microsoft’s Bing Maps, and MapQuest (which is based on OpenStreetMap’s Nominatim service), to geocode this sample of addresses in San Francisco.

These geocoding APIs can both be accessed from Open Refine. Here is how to geocode addresses from Open Refine using the Bing API:

You will need a Bing Maps API key. To obtain that, follow the steps here. If you don’t already have a Microsoft Account, you will first need to create one.

Create a new Open Refine project by importing a text file containing complete addresses in one column, with the heading address. Our test data is already in this format; use sf_test_addresses_short.txt for this exercise.

From the address column, select Edit column>Add column by fetching URLs..., call the column bing_json and use the following expression:

"http://dev.virtualearth.net/REST/v1/Locations?q=" + escape(value, "url") + "&key=BingMapsKey"

Note that you will have to enter your own Bing API key in place of BingMapsKey. Also, set the Throttle delay to 500 milliseconds for faster processing. This expression constructs a URL that will query the Bing geocoding API and return data for the address in question in JSON format.

From the bing_json column, select Edit column>Add column based on this column..., call the column bing_lat_lon and use this expression to extract the latitude and longitude from the JSON returned by the API:

with(value.parseJson().resourceSets[0].resources[0].point.coordinates, pair, pair[0] +", " + pair[1])

Split the bing_lat_lon column into to two columns by selecting Edit column>Split into several columns, then rename these columns bing_latitude and bing_longitude.

From the bing_json column, select Edit column>Add column based on this column..., call the column bing_confidence and use this expression to extract the Bing API’s confidence in the accuracy of its geocoding:

with(value.parseJson().resourceSets[0].resources[0].confidence, v, v)

From the bing_json column, select Edit column>Add column based on this column..., call the column bing_type and use this expression to extract the type of place that the Bing API has geocoded:

with(value.parseJson().resourceSets[0].resources[0].entityType, v, v)

For a full address, this should return Address when the geocoding has been successful.

Finally, delete the bing_json column by selecting Edit column>Remove column.

As we saw in week 5, it is now possible to extract JSON code that will allow you repeat these steps on any data in the same format. Below I have done that to provide code that will geocode addresses using both the Bing and MapQuest services:

[
  {
    "op": "core/column-addition-by-fetching-urls",
    "description": "Create column mapquest_json at index 1 by fetching URLs based on column address using expression grel:\"http://open.mapquestapi.com/nominatim/v1/search?format=json&limit=1&q=\" + escape(value, \"url\")\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "mapquest_json",
    "columnInsertIndex": 1,
    "baseColumnName": "address",
    "urlExpression": "grel:\"http://open.mapquestapi.com/nominatim/v1/search?format=json&limit=1&q=\" + escape(value, \"url\")\"",
    "onError": "set-to-blank",
    "delay": 500
  },
    {
    "op": "core/column-addition",
    "description": "Create column mapquest_longitude at index 2 based on column mapquest_json using expression grel:with(value.parseJson()[0].lon,v,v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "mapquest_longitude",
    "columnInsertIndex": 2,
    "baseColumnName": "mapquest_json",
    "expression": "grel:with(value.parseJson()[0].lon,v,v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column mapquest_longitude using expression value.toNumber()",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "mapquest_longitude",
    "expression": "value.toNumber()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition",
    "description": "Create column mapquest_latitude at index 2 based on column mapquest_json using expression grel:with(value.parseJson()[0].lat,v,v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "mapquest_latitude",
    "columnInsertIndex": 2,
    "baseColumnName": "mapquest_json",
    "expression": "grel:with(value.parseJson()[0].lat,v,v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column mapquest_latitude using expression value.toNumber()",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "mapquest_latitude",
    "expression": "value.toNumber()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition",
    "description": "Create column mapquest_class at index 2 based on column mapquest_json using expression grel:with(value.parseJson()[0].class,v,v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "mapquest_class",
    "columnInsertIndex": 2,
    "baseColumnName": "mapquest_json",
    "expression": "grel:with(value.parseJson()[0].class,v,v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column mapquest_type at index 2 based on column mapquest_json using expression grel:with(value.parseJson()[0].type,v,v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "mapquest_type",
    "columnInsertIndex": 2,
    "baseColumnName": "mapquest_json",
    "expression": "grel:with(value.parseJson()[0].type,v,v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-removal",
    "description": "Remove column mapquest_json",
    "columnName": "mapquest_json"
  },
  {
    "op": "core/column-addition-by-fetching-urls",
    "description": "Create column bing_json at index 1 by fetching URLs based on column address using expression grel:\"http://dev.virtualearth.net/REST/v1/Locations?q=\" + escape(value, \"url\") + \"&key=BingMapsKey\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "bing_json",
    "columnInsertIndex": 1,
    "baseColumnName": "address",
    "urlExpression": "grel:\"http://dev.virtualearth.net/REST/v1/Locations?q=\" + escape(value, \"url\") + \"&key=BingMapsKey\"",
    "onError": "set-to-blank",
    "delay": 500
  },
  {
    "op": "core/column-addition",
    "description": "Create column bing_lat_lon at index 2 based on column bing_json using expression grel:with(value.parseJson().resourceSets[0].resources[0].point.coordinates, pair, pair[0] +\", \" + pair[1])value",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "bing_lat_lon",
    "columnInsertIndex": 2,
    "baseColumnName": "bing_json",
    "expression": "grel:with(value.parseJson().resourceSets[0].resources[0].point.coordinates, pair, pair[0] +\", \" + pair[1])value",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-split",
    "description": "Split column bing_lat_lon by separator",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "bing_lat_lon",
    "guessCellType": true,
    "removeOriginalColumn": true,
    "mode": "separator",
    "separator": ",",
    "regex": false,
    "maxColumns": 0
  },
  {
    "op": "core/column-rename",
    "description": "Rename column bing_lat_lon 1 to bing_latitude",
    "oldColumnName": "bing_lat_lon 1",
    "newColumnName": "bing_latitude"
  },
  {
    "op": "core/column-rename",
    "description": "Rename column bing_lat_lon 2 to bing_longitude",
    "oldColumnName": "bing_lat_lon 2",
    "newColumnName": "bing_longitude"
  },
  {
    "op": "core/column-addition",
    "description": "Create column bing_confidence at index 2 based on column bing_json using expression grel:with(value.parseJson().resourceSets[0].resources[0].confidence, v, v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "bing_confidence",
    "columnInsertIndex": 2,
    "baseColumnName": "bing_json",
    "expression": "grel:with(value.parseJson().resourceSets[0].resources[0].confidence, v, v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column bing_type at index 2 based on column bing_json using expression grel:with(value.parseJson().resourceSets[0].resources[0].entityType, v, v)",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "bing_type",
    "columnInsertIndex": 2,
    "baseColumnName": "bing_json",
    "expression": "grel:with(value.parseJson().resourceSets[0].resources[0].entityType, v, v)",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-removal",
    "description": "Remove column bing_json",
    "columnName": "bing_json"
  }
]

Again, you will need to replace BingMapsKey in the above code with your own Bing API key. In class we will use this code to geocode the larger sf_test_addresses.txt dataset.

For the MapQuest results, the mapquest_class column provides information on the accuracy of geocoding: place, amenity or shop indicate geocoding to a precise address; highway indicates geocoding to a street only. The mapquest_type column provides further information about the address or street concerned.

The Bing and MapQuest services can also be accessed through the GPS Visualizer geocoder. To geocode addresses in bulk from this site using MapQuest, you will need to obtain a MapQuest AppKey, following these instructions.

The following steps detail how to geocode using the Mapquest service at this site. In the form shown below, select MapQuest Open as the Source, and enter your AppKey where shown:

GPS Visualizer’s geocoder will work from a simple list of addresses, or from tabular data, with different aspects of the address (street address, city, state, zipcode and so on) in separate fields. Set the Type of data control as appropriate. If you are working with tabular data, with the address divided into several fields, adjust the Field separator in output control to reflect the separator in your data. I strongly recommend using the raw list, 1 address per line option, which in my experience gives much more reliable results.

Checking Include source+precision info in output will ensure that the output includes notification of the accuracy of the geocoding for each record: address indicates precise geocoding to a particular address.

Paste the address data from the sf_test_addresses.txt file into the Input: box,
click Start geocoding and the results will appear in the Results as text: box. When all the addresses have been processed, copy and paste the results into a text file and save. If you have a large number of addresses to geocode, I recommend breaking them down into batches of 1,000 or fewer.

Whichever service you use to geocode addresses, provide appropriate acknowledgement. MapQuest’s terms and conditions require that you include this acknowledgement on any website or app using data geocoded through its service:

<p>Geocoding Courtesy of <a href="http://www.mapquest.com/" target="_blank">MapQuest</a> <img src="http://developer.mapquest.com/content/osm/mq_logo.png"></p>

Data should also be sourced to OpenStreetMap, see here for instructions on how to credit appropriately.

Here is an HTML acknowledgment to Bing in the same style as above:

<p>Geocoding Courtesy of <a href="http://www.microsoft.com/maps/product/terms.html" target="_blank">Bing</a> <img src="http://www.microsoft.com/maps/images/branding/Bing%20logo%20gray_50px-19px.png"></p>

Be aware that different geocoders will give slightly different results. In my experience, MapQuest tends to locate addresses to sidewalks or building fronts, while Bing tends to locate to the middle of the building concerned. Bing’s failure rate also appears to be lower. You may need to manually record the coordinates of addresses that fail, or which do not geocode to a precise address. In these cases, try searching for the address on Bing Maps or Google Maps. For the latter, note the latitude and longitude for the placemarker than appears, shown here after the @ symbol:

https://www.google.com/maps/place/1875+Cesar+Chavez+St,+San+Francisco,+CA+94107/@37.7497825,-122.395751,17z/data=!3m1!4b1!4m2!3m1!1s0x808f7fae0a545527:0x564005c073e75262

Other options for geocoding include Texas A&M University’s GeoServices, which will geocode from an uploaded text file, emailing you when the results are ready for download. First sign up for a free account, then upload your data.

Map projections

Because the Earth is roughly spherical, any map other than a globe is a distortion of reality. Just as you can’t peel an orange and arrange the skin as a perfect rectangle, circle, or ellipse, it is impossible to plot the Earth’s surface in two dimensions and accurately represent distances, areas, shapes and directions.

Maps can be made simply by plotting latitude on the X axis and longitude on the Y axis on the same scale, sometimes called an Equirectangular projection:

(Source: Wikimedia Commons)

Most maps are drawn according to a more sophisticated projection system, however. There are many different systems, each of which has advantages and drawbacks. Some projections are optimized to minimize the distortion of area; others aim to preserve shape or distance; yet others keep directions constant.

Google and most other online maps use a Mercator projection, which was originally designed for navigation at sea. The main strength of the Mercator projection is that it preserves direction, so that any straight line drawn on the map is a line of constant compass bearing. Parallels are all horizontal and meridians vertical. This preservation of direction is also a good choice for zoomable maps used primarily for local orientation. The big drawback of this projection is that it distorts area and shape, especially at high latitudes, which makes it a poor choice for representing the entire world. Notice how the distances between parallels increase with latitude:

(Source: Wikimedia Commons)

When mapping the continental United States, particularly when coloring or shading different areas according to the values of data, it is common to use the Albers Equal Area Conic projection, as seen in this map of drought conditions across the nation:

(Source: The New York Times)

As the name suggests, this projection minimizes distortions of area. It does not preserve direction: Notice that the border with Canada, which runs along a parallel at a latitude of 45 degrees N, is a curve, rather that a straight line.

The Albers Equal Area Conic projection is rarely used to show the entire Earth, for obvious reasons when you see the projection in global view:

(Source: Wikimedia Commons)

To minimize the distortion of area on a global map, a better choice is the Mollweide projection:

(Source: Wikimedia Commons)

The Mollweide projection is also often used for maps of the entire sky (which can be thought of as the inside of a sphere). I used it here to compare the resolution of maps of the cosmic microwave background radiation, which reveal ripples in space-time that are the remnants of conditions in the early Universe, with views of the Earth:

(Source: New Scientist)

The Mollweide projection’s main disadvantage is the distortion of shape at high latitudes and longitudes — look, for example, at Alaska on the above Mollweide maps.

Under certain circumstances, preserving distance may by the most important goal. Here, an Azimuthal Equidistant Projection is the best approach:

(Source: Wikimedia Commons)

Below, an Azimuthal Equidistant projection, centered on North Korea, is used to illustrate the locations that might lie within the range of that country’s ballistic missiles:

(Source: Jason Davies)

Here, for comparison, is a map that highlights the zone within 10,000 km of North Korea using Google Maps’ Mercator projection:

(Source: Darren Wiens)

As the North Korean Azimuthal Equidistant and US drought Albers Equal Area Conic maps show, projections can be centered on any point on the Earth — they do not have to be centered on the intersection between the Equator and the Prime Meridian, which is the most common view for a global map.

Distortions of shape, area, distance and direction are most obvious when representing the entire globe. Under these circumstances, mapmakers often adopt a compromise projection in which distance, area, shape, and direction are all distorted, but to a minimal extent. An example is the Robinson projection:

(Source: Wikimedia Commons)

This was the projection I used for the global GDP per capita maps we saw in week 1:

In addition to a projection, a map also has a datum, which refers to a mathematical model accounting for the shape of the Earth — which is not a perfect sphere. Under most circumstances, however, you will not need to worry about this.

Putting data onto maps

Scaled circles vs. choropleth maps

Data can be put onto maps in various ways. When continuous variables are plotted to points, one common approach is to use circles centered on each point, sized according to the data values — as we did in week 3 when mapping Berkeley traffic accidents. Here is another example of this approach, used to show fatalities caused by tornadoes:

(Source: The New York Times)

When plotting data to geographical areas, the most common approach is to fill the areas with color according to the data values, like my maps of GDP per capita, or the US drought map above. These are known as choropleth maps.

Choropleth maps have an important drawback: Our eyes are drawn to expanses of color, which means that large geographic areas will attract greater attention, whether or not these are actually more important for the story you are trying to tell from the data. This becomes a particular problem with maps illustrating election results, where the significance of small geographical areas with large populations that have a major impact on the overall result gets downplayed, while sparsely populated large areas are overemphasized. Looking at this map of results from the 2012 Presidential election by county, for example, one would think at a glance that Mitt Romney was the winner:

(Source: The New York Times)

In such cases, scaled circles located to the center of geographic areas can be a better option. Here is another map from the same interactive, using that approach to visualize the size of each candidate’s lead in each county, measured by the absolute number of votes. This shows how Barack Obama won the election through his strong support in densely populated urban areas:

(Source: The New York Times)

Cartograms

Another solution to the main drawback of choropleth maps is to distort the areas plotted on the map to reflect aspects of the data, rather than geographical reality. These maps are called cartograms.

There are several algorithms for making cartograms which preserve the boundaries between geographical areas, which result in “organically” distorted maps. Here, for example, is a rendering of the 2012 Presidential Election results by county, distorted using the algorithm described in this scientific paper:

(Source: Mark Newman)

A good tool for making maps like this is Scapetoad. However, bear in mind that the impact of these maps derives from their disconcerting perspective. That can be useful to make your audience think about an issue in a new way, which was the thinking behind these maps of mine, comparing nations measured by GDP, and by a measure called the Happy Planet Index:

(Source: New Scientist)

The cartograms we have seen so far retain common borders between areas, which constrains the accuracy with which areas can be resized according to values for a continuous variable. By relaxing this constraint, it is possible to resize areas more precisely:

(Source: Mike Bostock)

However, bear in mind that it is hard to compare the areas of non-regular shapes, so either form of cartogram is not so useful if you want your audience to be able to “read” the data in a precise way.

It is also possible to make geometric cartograms, which use the area of shapes (generally circles or squares) to make a more abstract “map.” This graphic from The New York Times, published during the 2012 Presdential election campaign, took this approach:

(Source: The New York Times)

Along similar lines, for its coverage of the 2010 U.K. General Election, the BBC represented each parliamentary constituency as a hexagon of equal area. The resulting map bore sufficient resemblance to an actual map of the United Kingdom to be meaningful, and users of the website could switch between the proportional and geographical maps to gain a more complete picture of the results by location:

(Source: BBC)

Dot density maps: Seeing the big picture by showing all (or most) of the data

Sometimes patterns emerge from geographic data when we see the spatial distribution of every single occurrence of a phenomenon. This is the thinking behind dot density maps, like this visualization of the 2010 U.S. Census, which includes a colored point for every single person:

(Source: Dustin Cable, University of Virginia)

The overall effect is rather like pointillist art. These maps work well when zoomed out, but are not so informative at high zoom levels.

A similar approach can work with aggregations of data, as in this project from The New York Times, which drew one dot for every 200 people, rather than one dot per person:

(Source: The New York Times)

Making sense of many overlapping points: Heatmaps vs. hexagonal binning

While dot density maps can be useful on occasion, sometimes you may need to tell a story based on the distribution of points where they overlap, or sit directly on top of one another. This can present a misleading picture, as much of the data will be obscured.

Under such circumstances, other approaches are necessary. Heat maps, for example, plot the density of points on a map as a gradient of colors, typically running from cool blues to warm reds. Here, to illustrate, I have used this approach to map violent events in Syria’s civil war from its start to the end of the first quarter of 2013, revealing “hotspots” of violence that were not so obvious from a map of thousands of overlapping points, seen below:

(Source: Peter Aldhous, from GDELT data)

While heatmaps are good for qualitatively identifying hotspots, they are less useful for communicating quantitative information. For this purpose, a better approach is to superimpose a hexagonal grid over the map, count the points in each cell, and use those counts to create a choropleth map, based on the grid. I used that approach on the same data to make this map of Syria’s conflict:

(Source: New Scientist)

Think before you map: Is this the best representation of the data?

Whenever you come across data that can be put on a map, it’s very tempting to do this. However, always ask yourself: Is this the best way of telling my story? From the examples above, you will see that most maps encode data either using color, or through the area of circles or other symbols. You will remember from week 2 that these two visual encodings fall fairly low down on the perceptual heirarchy of visual cues, making it relatively hard for your audience to make accurate, quantititative comparisons.

(Source: Creative Bloq)

Consider these two representations of similar data on rates of overall gun death (the map) and gun homicides (the bar chart) by U.S. state:

(Source: Rolling Stone)

(Source: Flowing Data)

The bar chart clearly allows the more detailed comparison between rates for different states. However, the map still has value because it does show that the states with the highest gun death rates occur in particular geographic locations. In cases like this, consider using a map as only one part of your graphic, perhaps as a secondary element.

Static vs. zoomable tiled maps

When designing a map-based graphic, one of the first things to decide is whether you want to display a static map view, or whether users should be able to pan and zoom the map in a dynamic way.

Web maps that can be panned and zoomed generally depend on a series of world maps of different zoom levels, which are each divided into square tiles. The tiles are loaded into the web browser as required as the user pans and zooms the map. This image demonstrates the principle:

(Source: Microsoft Developer Network)

Zoomable data-driven web maps are often displayed over basemaps from Google, OpenStreetMap, or another provider. Because these basemaps use a Mercator projection, that projection needs to be used for the data layers also.

Geographic data formats

KML

KML, or Keyhole Markup Language, is the format used to display data on Google Earth and Google Maps. As the name suggests, it is based on XML, and has a similar structure of nested tags.

These tags can define a range of elements including point markers such as the familiar placemarks used on Google Maps, lines, and the boundaries of geographical areas, known as “polygons.” The coordinates of these elements, their color and other aspects of their styling, and the information bubbles that may appear when the elements are clicked, can all be encoded in the KML.

Here, for example, is a simple KML file coding for an exaggeratedly tall representation of The Pentagon. Notice how the coordinates for the polygon give latitudes and longitudes that define the inner and outer boundaries of the building, and locate the “roof” at a height of 100 meters above ground level. <extrude>1</extrude> extends the shape to the ground:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>The Pentagon</name>
    <Polygon>
      <extrude>1</extrude>
      <altitudeMode>relativeToGround</altitudeMode>
      <outerBoundaryIs>
        <LinearRing>
          <coordinates>
            -77.05788457660967,38.87253259892824,100 
            -77.05465973756702,38.87291016281703,100 
            -77.05315536854791,38.87053267794386,100 
            -77.05552622493516,38.868757801256,100 
            -77.05844056290393,38.86996206506943,100 
            -77.05788457660967,38.87253259892824,100
          </coordinates>
        </LinearRing>
      </outerBoundaryIs>
      <innerBoundaryIs>
        <LinearRing>
          <coordinates>
            -77.05668055019126,38.87154239798456,100 
            -77.05542625960818,38.87167890344077,100 
            -77.05485125901024,38.87076535397792,100 
            -77.05577677433152,38.87008686581446,100 
            -77.05691162017543,38.87054446963351,100 
            -77.05668055019126,38.87154239798456,100
          </coordinates>
        </LinearRing>
      </innerBoundaryIs>
    </Polygon>
  </Placemark>
</kml>

Here is how this file displays in Google Earth:

(Source: Google Earth)

See Google’s tutorial and reference for a guide to the tags that can be used to code KML.

KML can also be compressed into KMZ files. To create a KMZ file from KML, open the file in Google Earth, right-click on the file in the Places panel, select Save Place As, and then select KMZ under format.

KML has been adopted as a standard for geographic data, and so can be used by a wide range of mapping applications, including Geographic Information Systems (GIS) software.

GeoJSON

GeoJSON is a variant of JSON develeoped for encoding geographic data, commonly used for data-driven online maps. Its overall structure is the same as conventional JSON. Each Feature has properties, which can be any data related to the feature, geometry, which includes its type (point, polygon and so on), and latitude and longitude coordinates. Features can be grouped into a FeatureCollection. Here, for example, are the first ten addresses we geocoded earlier using the Bing maps API, encoded as GeoJSON (you may need to scroll to the right to see all of the data):

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "properties": { "address": "1800 25th St, San Francisco, CA, 94107", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.753067, "bing_longitude": -122.397484 }, "geometry": { "type": "Point", "coordinates": [ -122.397484, 37.753067 ] } },
{ "type": "Feature", "properties": { "address": "302 Silver Ave, San Francisco, CA, 94112", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.727722, "bing_longitude": -122.430573 }, "geometry": { "type": "Point", "coordinates": [ -122.430573, 37.727722 ] } },
{ "type": "Feature", "properties": { "address": "425 7th St, San Francisco, CA, 94103", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.775398, "bing_longitude": -122.404366 }, "geometry": { "type": "Point", "coordinates": [ -122.404366, 37.775398 ] } },
{ "type": "Feature", "properties": { "address": "2001 Chestnut St, San Francisco, CA, 94123", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.800575, "bing_longitude": -122.436508 }, "geometry": { "type": "Point", "coordinates": [ -122.436508, 37.800575 ] } },
{ "type": "Feature", "properties": { "address": "952 Sutter St, San Francisco, CA, 94109", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.788548, "bing_longitude": -122.416161 }, "geometry": { "type": "Point", "coordinates": [ -122.416161, 37.788548 ] } },
{ "type": "Feature", "properties": { "address": "105 Palm Ave, San Francisco, CA, 94118", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.783619, "bing_longitude": -122.458206 }, "geometry": { "type": "Point", "coordinates": [ -122.458206, 37.783619 ] } },
{ "type": "Feature", "properties": { "address": "1111 California St, San Francisco, CA, 94108", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.791183, "bing_longitude": -122.412956 }, "geometry": { "type": "Point", "coordinates": [ -122.412956, 37.791183 ] } },
{ "type": "Feature", "properties": { "address": "1501 Larkin St, San Francisco, CA, 94109", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.791859, "bing_longitude": -122.419556 }, "geometry": { "type": "Point", "coordinates": [ -122.419556, 37.791859 ] } },
{ "type": "Feature", "properties": { "address": "4508 Balboa St, San Francisco, CA, 94121", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.775490, "bing_longitude": -122.507271 }, "geometry": { "type": "Point", "coordinates": [ -122.507271, 37.77549 ] } },
{ "type": "Feature", "properties": { "address": "101 Turk St, San Francisco, CA, 94102", "bing_type": "Address", "bing_confidence": "High", "bing_latitude": 37.782890, "bing_longitude": -122.411064 }, "geometry": { "type": "Point", "coordinates": [ -122.411064, 37.78289 ] } }
]
}

See the full GeoJSON specification for more details.

TopoJSON is an extension of GeoJSON which is more compact, because polygons are described by line segments, rather than their entire boundaries. This means that the boundary between California and Nevada, for instance, is represented only once, rather than twice — once for each state. This keeps file sizes small, which can be advantageous when data must be loaded and rendered in a web browser.

Shapefile

This is a geodata format developed by ESRI, manufacturer of ArcGIS, the leading commercial GIS application. Shapefiles can represent elements including points, lines and polygons, and can also include information on map projection and datums.

Shapefiles are usually made available for download as zipped folders, and actually consist of a series of files. At a minimum, a shapefile must contain three component files, with the same root name and the following extensions:

.shp The main file containing the geometry of the points, lines or polygons mapped in the shapefile.
.dbf A database file in dBASE format containing a table of data relating to the components of the geometry. For example, in a world shapefile giving national boundaries, this table might contain data about the countries including their names, capital cities, population, annual GDP and so on.
.shx A positional index of the shapefile’s geometry.

There are several optional file types that may also be included, including a .prj file, which defines the map projection and datum to be used when loading the shapefile into GIS software. Refer to ESRI’s technical specification and the informative Wikipedia entry for more details.

Many government agencies, such as the U.S. Census Bureau, provide data for mapping as shapefiles. You can also download shapefiles from repositories such as Natural Earth.

You can open and edit .dbf files in LibreOffice Calc. As long as you save the .dbf file with the same name, your changes to the data will be incorporated into the shapefile.

Converting between geodata formats

We will later learn how to use QGIS to convert between the main geodata formats. In addition, this site converts shapefiles to GeoJSON and TopoJSON.

Assignment

Decide which dataset(s) you wish to explore for your final project. You can use one of the suggested datasets, adding further data as appropriate. You are also free to pursue a story in other datasets, with my agreement.

Frame some questions you intend to address, or a potential story you wish to pursue in the data.
Produce some initial sketches, using the tools we’ve worked with so far.
Send those notes and sketches to me before next week’s class.

This is an open-ended assignment. What you get from it, and ultimately the quality of your final project, will depend on the energy, rigor and imagination with which your pursue this assignment.