Spatial Data in R

The motivation to want to draw on map tiles directly within R is twofold.
While the R environment boasts a long history of spatial analysis tools and libraries, the plots have traditionally been created without any spatial context that a map would provide.

The meuse data set is the “Hello world” equivalent of the sp package. I quote:

It gives locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15 m x 15m.

Note the “strange” coordinate system

#require(sp)    
data(meuse) 
head(meuse[,1:8])   
##        x      y cadmium copper lead zinc  elev       dist
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700
coordinates(meuse) <- c("x", "y")   
#par( mar=c(2,2,2,2), cex.main=1.5,cex.lab=2)       
#bubble(meuse, "zinc", main = "zinc concentrations (ppm)",#maxsize=5,   
#       key.entries =  100+ 100 * 2^(0:4),key.space = "inside") 
LevCols = rev(heat.colors(5))[cut(meuse$lead,breaks=c(0,100,200,300,400,Inf),labels=FALSE)] 
    
plot(meuse, bg=LevCols, col="grey",main="Lead Distribution", pch=21, axes=TRUE) 

There are times when such a plot will be the preferred choice: a clean and simple graph of the locations of interest with no clutter and no distractions. The analyst can focus on the pattern itself and the marker attributes.

However, a lot of the modern massive data sets gathered such as location information from mobile devices, surveys, crime data, vehicle tracks, demographic data, etc. require a map based spatial context for even the most basic data explorations.
In those settings, the somewhat narrow or ``blind’’ exploration of spatial data on a blank background can be rather limiting and often leads to less insight than would be possible had the data been graphed on a map canvas.
Anecdotally, we dare to speculate that the British phyisican John Snow might have been slower to identify contaminated drinking water as the cause of the cholera epidemic in 1854, had he not used a dot map to illustrate the cluster of cholera cases around one of the pumps in London.

Geo referenced and temporally recorded data

This tutorial uses two crime data set as examples for spatiotemporal data:

  1. A data set that contains all of the police incidents (except homicide and manslaughter) that were reported in San Francisco in 2012. The full data set is freely available for download ( https://data.sfgov.org/ ) and contains the date, time of day, and day of week that each incident was reported, along with the incident category, a brief description, and location (as police district, lat-long coordinates, and address to the nearest 100 block).

  2. A lightly cleaned Houston crime data set from January 2010 to August 2010 geocoded with Google Maps

Let us first load and inspect the San Francisco data set:

load("data/SFincidents2012.rda")#incidents  
head(incidents) 
##   IncidntNum      Category                    Descript DayOfWeek
## 1  120000499       ROBBERY       ROBBERY, BODILY FORCE    Sunday
## 2  120000938  NON-CRIMINAL DEATH REPORT, CAUSE UNKNOWN    Sunday
## 3  120001936       SUICIDE          SUICIDE BY JUMPING    Sunday
## 4  120002235 VEHICLE THEFT           STOLEN AUTOMOBILE    Sunday
## 5  120003186  NON-CRIMINAL DEATH REPORT, CAUSE UNKNOWN    Monday
## 6  120000041       ASSAULT                     BATTERY    Sunday
##         Date  Time PdDistrict      Resolution                  Location
## 1 2012-01-01 01:50    CENTRAL JUVENILE BOOKED    VALLEJO ST / POWELL ST
## 2 2012-01-01 06:40   NORTHERN            NONE  800 Block of OFARRELL ST
## 3 2012-01-01 14:45   NORTHERN            NONE    1200 Block of GOUGH ST
## 4 2012-01-01 17:00    MISSION            NONE   1600 Block of BRYANT ST
## 5 2012-01-02 08:30    CENTRAL            NONE 1200 Block of STOCKTON ST
## 6 2012-01-01 00:25   SOUTHERN            NONE    200 Block of MARKET ST
##           X        Y violent censusBlock
## 1 -122.4105 37.79837    TRUE 06075010700
## 2 -122.4183 37.78516   FALSE 06075012202
## 3 -122.4244 37.78436   FALSE 06075016000
## 4 -122.4105 37.76562   FALSE 06075017700
## 5 -122.4084 37.79679   FALSE 06075010700
## 6 -122.3974 37.79244    TRUE 06075011700

We defined the following crimes as violent: as assault, robbery, rape, kidnapping, and purse snatching.

Exercises to get to know the data set:

1. Compute the overall proportion of violent crime.

2. Display the top ten crime categories as a barplot. An example is shown below.

Spatial View of the data

Let us take a look at the data using the lat/lon coordinates.
It is easy enough to directly create a 2D scatterplot of the data color coding violence:

  #for speed let us plot just a sample: 
  i = sample(nrow(incidents),10000) 
  ViolCol=c("green","red")[as.numeric(incidents[i,"violent"])]  
  plot(Y ~ X, data = incidents[i,], col = ViolCol,  
       cex = 0.5, pch=20,xlab="lon", ylab="lat")    

If you are familiar with San Francisco you can almost discern the city boundaries as well as the major street grid from this plot !

3. Small exercise on the side: can you add transparency to the plot from before ?

What is the big deal with plotting lat/lon anyhow?

If you go back to the meuse data and check out the example code you find snippets as follows:

crs = CRS("+init=epsg:28992") # set original projection 
data("meuse")   
coordinates(meuse) <- ~x+y  
proj4string(meuse) <- crs   
    
merc = CRS("+init=epsg:3857")   
WGS84 = CRS("+init=epsg:4326")  
#this funcion requires rgdal to be installed!   
meuse2 = spTransform(meuse, WGS84)  
    
cbind(head(coordinates(meuse)),head(coordinates(meuse2)))   
##        x      y        x        y
## 1 181072 333611 5.758536 50.99156
## 2 181025 333558 5.757863 50.99109
## 3 181165 333537 5.759855 50.99089
## 4 181298 333484 5.761746 50.99041
## 5 181307 333330 5.761863 50.98903
## 6 181390 333260 5.763040 50.98839
plot(meuse2,    
     bg=rev(heat.colors(5))[cut(meuse2$lead,breaks=c(0,100,200,300,400,Inf),labels=FALSE)], 
     col="grey",main="Lead Distribution", pch=21, axes=TRUE)    

Gauss’s Theorema Egregium proved that a sphere’s surface cannot be represented on a plane without distortion. The same applies to other reference surfaces used as models for the Earth. Since any map projection is a representation of one of those surfaces on a plane, all map projections distort. Every distinct map projection distorts in a distinct way. The many choices of map projections is the result of different people/problems/contexts wanting to preserve different metrics such as

Strange spherical geometry

The Mercator projection exaggerates areas far from the equator. For example:

The Mercator projection portrays Greenland as larger than Australia; in actuality, Australia is more than three and a half times larger than Greenland.

The Mercator projection portrays Greenland as larger than Australia; in actuality, Australia is more than three and a half times larger than Greenland.

The classic way of showing the distortion inherent in a projection is to use Tissot’s indicatrix.

Great Circles

LA <- c(-118.40, 33.95) 
NY <- c(-73.78,  40.63) 
data(wrld)  
plot(wrld, type='l')    
gc <- greatCircle(LA, NY)   
lines(gc, lwd=2, col='blue')    

How can we quickly find the lat/lon coordinates of cities?

Web interfaces, such as http://www.gpsvisualizer.com/geocode

Or the built-in function getGeoCode() from the RgoogleMaps package.

Geodesic Distances

How do we compute the distance between two points on the surface of the earth?

#LA-NY example from above (in km):  
cat("The geodesic distance between LA and NY is ", distHaversine(LA,NY)/1000,"km \n")   
## The geodesic distance between LA and NY is  3977.614 km

Note that the order is (lon,lat) for the points passed to distHaversine!

Geo Exercises

4. Plot and compute the great circle distance between Sydney and Aalborg

#first geocode the two cities:  

5. Inferring from the picture below, verify that increasing the latitude difference can decrease the geodesic distance!

The origin is at (lon=0,lat=80)

#destPoint(LA, b=65, d=100000)  
circle=destPoint(c(0,80), b=1:365, d=1000000)   
circle2=destPoint(c(0,80), b=1:365, d=800000)   
circle3=destPoint(c(0,80), b=1:365, d=100000)   
plot(circle, type='l')  
polygon(circle, col='blue', border='black', lwd=2)  
polygon(circle2, col='red', lwd=2, border='red')    
polygon(circle3, col='white', lwd=2, border='black')    
rect(circle2[61,"lon"],circle2[61,"lat"],circle2[61,"lon"]+4,circle2[61,"lat"]+4,   
     border="orange",lwd=2.5)   

Hint: The lower left corner of the rectangle is lon=44.04516, lat=80.975, and the latitude somewhere in the middle of it is lat=82.77

lat=80  
pnt=c(0,lat)    
circle2=destPoint(pnt, b=1:365, d=800000)   
#find the touching tangent: 
line <- rbind(circle2[61,], circle2[61,]+c(0,4))    
d = dist2Line(pnt, line)    
#plot( makeLine(line), type='l')    
circle1=destPoint(pnt, b=1:365, d=d[,1])    
plot(circle, type='n');grid()   
# points(pnt, col='purple', pch='x',cex=2)  
# plot(c(-5,5),c(79.25,80.75), type='n');grid() 
# points(matrix(c(0,80),nrow=1), col='purple', pch='x',cex=2)   
points(0,lat, col='purple', pch='x',cex=1.25)   
xy.coords(c(0,80))  
## $x
## [1] 1 2
## 
## $y
## [1]  0 80
## 
## $xlab
## [1] "Index"
## 
## $ylab
## NULL
#points(0,80, col='purple', pch='x',cex=1.25)   
polygon(circle2, col=rgb(1,0,0,0.1), lwd=2, border='red')   
polygon(circle1, col=rgb(0,0,1,0.1), border='black', lwd=2) 
rect(circle2[61,"lon"],circle2[61,"lat"],circle2[61,"lon"]+4,circle2[61,"lat"]+4,   
     border="orange",lwd=2.5)   
    
points(line)    
    
points(d[,2], d[,3], col='green', pch='x')  
gc1 <- greatCircle(pnt, d[,2:3])    
lines(gc1, lwd=1, col='green')  
gc2 <- greatCircle(pnt, circle2[61,])   
lines(gc2, lwd=1, col='blue')   

#match(T, circle1[,"lat"]<= circle2[61,"lat"])  

Google/Bing/OSM maps and tiles

Although the Mercator projection significantly distorts scale and area (particularly near the poles), it has two important properties that outweigh the scale distortion:

  1. It’s a conformal projection, which means that it preserves the shape of relatively small objects. This is especially important when showing aerial imagery, because we want to avoid distorting the shape of buildings. Square buildings should appear square, not rectangular.

  2. It’s a cylindrical projection, which means that north and south are always straight up and down, and west and east are always straight left and right.

At the lowest level of detail (Level 1), the map is 512 x 512 pixels. At each successive level of detail, the map width and height grow by a factor of 2: Level 2 is 1024 x 1024 pixels, Level 3 is 2048 x 2048 pixels, Level 4 is 4096 x 4096 pixels, and so on.

For example, at level 3, the pixel coordinates range from (0, 0) to (2047, 2047), like this:

Pixel coordinates at level 3

Pixel coordinates at level 3

To optimize the performance of map retrieval and display, the rendered map is cut into tiles of \(256 x 256\) pixels each. As the number of pixels differs at each level of detail, so does the number of tiles:

map width = map height = \(2^{level}\) tiles

Each tile is given XY coordinates ranging from (0, 0) in the upper left to \((2^{level}-1, 2^{level}-1)\) in the lower right.

Quadtree structure

Quadtree structure

If you want to read more about the efficient use of Quadkeys:
https://msdn.microsoft.com/en-us/library/bb259689.aspx

Last Exercise: Using the LatLon2XY function from RgoogleMaps find the XY coordinates of Aalborg University at zoom level 12.

Now that we have a basic understanding of issues surrounding maps, let us move to the 2nd part:

RgoogleMaps Tutorial