1

I am generating a map with 6.8 million different points (latitude and longitude combinations) using ggmap and ggplot2. I succeeded but it took quite a while (6 hours).

I noticed R does not use all my cores so my next challenge would be to use all the power I have at my disposal.

How would I do that?

I know the 'parallel' package does that but since I am still learning R, I am not sure what to do with it. Here is a sample of the data (it's from the NYC open data platform):

df <- structure(list(pickup_datetime = structure(c(19L, 7L, 13L, 10L, 
9L, 9L, 14L, 4L, 16L, 1L, 3L, 12L, 18L, 11L, 2L, 17L, 5L, 15L, 
8L, 6L), .Label = c("01/02/2015 03:40:12 PM", "01/04/2015 01:03:42 AM", 
"01/05/2015 12:22:10 PM", "01/05/2015 12:58:10 PM", "01/06/2015 02:16:47 PM", 
"01/08/2015 12:19:51 PM", "01/09/2015 03:45:22 PM", "01/10/2015 07:15:39 PM", 
"01/11/2015 08:37:20 PM", "01/13/2015 06:57:29 PM", "01/15/2015 03:03:59 AM", 
"01/15/2015 10:55:29 PM", "01/16/2015 10:07:38 PM", "01/21/2015 02:04:33 AM", 
"01/22/2015 04:48:35 PM", "01/23/2015 11:14:52 PM", "01/24/2015 06:35:44 PM", 
"01/25/2015 07:32:09 PM", "01/27/2015 07:30:40 PM"), class = "factor"), 
    Pickup_latitude = c(40.8353157043457, 40.6699333190918, 40.7466583251953, 
    40.7337608337402, 40.8157424926758, 40.8157424926758, 40.7239418029785, 
    40.8073272705078, 40.7512817382812, 40.8260154724121, 40.7934989929199, 
    40.7457313537598, 40.6872291564941, 40.6822357177734, 40.8117980957031, 
    40.7610969543457, 40.7501640319824, 40.7329254150391, 40.7140312194824, 
    40.8164672851562), Pickup_longitude = c(-73.9201583862305, 
    -73.9856719970703, -73.8925704956055, -73.8689346313477, 
    -73.9182586669922, -73.9182586669922, -73.950813293457, -73.9444198608398, 
    -73.9399795532227, -73.9514389038086, -73.9496078491211, 
    -73.9035873413086, -73.990119934082, -73.9935302734375, -73.9296035766602, 
    -73.9349060058594, -73.8618927001953, -73.9548034667969, 
    -73.9550933837891, -73.953971862793)), .Names = c("pickup_datetime", 
"Pickup_latitude", "Pickup_longitude"), row.names = c(NA, 20L
), class = "data.frame")

Here is my code:

library(plyr)
pickup <- count(df_sample, c("Pickup_latitude", "Pickup_longitude"))
detach("package:plyr", unload=TRUE)
library(dplyr)
pickup <- filter(pickup, Pickup_latitude != 0 | Pickup_longitude != 0)
library(ggplot2)
library(ggmap)
library(maps)
basemap <- get_map(location=c(lon= -73.8896695, lat= 40.74086), zoom = 11)
longitude = pickup$Pickup_longitude
latitude = pickup$Pickup_latitude
map1 <- ggmap(basemap, extent='panel', base_layer=ggplot(pickup, aes(x=longitude, y=latitude)))
print(map1)
map2 <- map1 + geom_point(color = "blue", size = 0.05)

Thank you,

nickfrenchy
  • 293
  • 4
  • 16
  • 9
    I think since it's very unlikely that you can render 6.8 million points in such a way that they're all visible, the usual advice is to aggregate the data in some sensible way first (kernel density estimation, hexagonal binning ...) . Can you please include data that will provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ? – Ben Bolker May 20 '16 at 20:06
  • Also, it is unlikely that you will be able to use all the cores as R is single threaded. Packages like `parallel` allow for mult-core use, but I don't think this is available for graphics. – lmo May 20 '16 at 20:12
  • 2
    The slow rendering probably has nothing to do with CPU but more of a problem with how ggplot uses GPU resources. See a related link here http://stackoverflow.com/questions/8364288/what-hardware-limits-plotting-speed-in-r Unless there is a way to ask R or ggplot2 to use GPU acceleration, this won't be an easy task. Instead, you should focus on some aggregation techniques as suggested by @BenBolker – Xiongbing Jin May 20 '16 at 20:15
  • 1
    In general, I have not been able to plot "large" data sets in `ggplot2` in a "reasonable" amount of time; I sometimes use some of the base R functions for moderate data sizes. In other cases, doing some sort of aggregation (e.g. max value in some bucket) is helpful. Is it possible for you to do that ? If one were to subsequently zoom in (i.e. looking at a smaller data set) you could then increase the resolution. – steveb May 20 '16 at 20:16
  • I figured using to whole dataset would be a bit too much but I am experimenting with R so I thought - why not? I agree about the aggregation though. I am trying to map every coordinate of the 2015 green cab pickups in NYC as they are bound by certain location rules. – nickfrenchy May 20 '16 at 20:33
  • 1
    The full data are available online if anyone want to experiment with it http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Interesting example of publicly available `big data` – Xiongbing Jin May 20 '16 at 20:38
  • I'm not sure I understand the question. Are you trying to multi-thread the rendering of `ggplot2`? Or use HW acceleration on the GPU? – alexwhitworth May 20 '16 at 21:16
  • I guess it would depend now. I want to render the map and its 6.8 million points using the maximum resources I have at my disposal so if it's CPU intensive, I want to be able to use all of my CPU instead of just one. However, if it's GPU, it might be a bit trickier I suppose but if possible, I'd like to know how. – nickfrenchy May 21 '16 at 00:55
  • So I tried this: ddply( pickup, .fun = function(x) { map1 <- ggmap(basemap, extent='panel', base_layer=ggplot(pickup, aes(x=longitude, y=latitude))) plot <- map1 + geom_point(color = "blue", size = 0.05) print(plot) }, .parallel = TRUE ) Of course, it's missing the .variables argument but I am really confused as of the values I'm supposed to put there. Any ideas? – nickfrenchy May 23 '16 at 18:26
  • Apparently, there are some initiatives to use `GPU` in `ggplot2` http://dspace.bracu.ac.bd/xmlui/bitstream/handle/10361/6398/12201051%20%26%2012201036_CSE.pdf?sequence=1&isAllowed=y – rafa.pereira Mar 13 '17 at 17:20

0 Answers0