3

I'm trying to rewrite a portion of my code with the latest features from dplyr by replacing data.frame() by data_frame() and cbind() with bind_cols():

library(rgeos)
library(dplyr)

mc <- montreal %>%
  gCentroid(byid=TRUE) %>%
  data.frame %>%
  cbind(., name = montreal[["NOM"]])

When I try to replace data.frame by data_frame I get:

Error: data_frames can only contain 1d atomic vectors and lists

And when I try to replace cbind by bind_cols I get:

Error: object at index 2 not a data.frame

Would there be a way to make this work ?

Here, montreal is a SpatialPolygonsDataframe:

GEOJSON file: http://elm.bi/limadmin.json

montreal <- readOGR("data/limadmin.json", "OGRGeoJSON")
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • Your `dput` output is not able to be reconstructed because of the S4 objects. You may want to try the solution to [this question](http://stackoverflow.com/questions/3466599/dputting-an-s4-object) and return with reproducible data. – cdeterman Jan 14 '15 at 13:13
  • Could you provide the lines of code to create it from scratch instead of `dput`? – cdeterman Jan 14 '15 at 13:28
  • I edited the post accordingly – Steven Beaupré Jan 14 '15 at 14:14
  • I can't read your file, but I think the errors say it all. Have you tried `bind_cols(., data_frame(name=montreal[["NOM"]])` on the last line? – Khashaa Jan 14 '15 at 14:59
  • 1
    How about `mc <- montreal %>% gCentroid(byid=TRUE) %>% data.frame %>% bind_cols(., data_frame(name=montreal[["NOM"]]))` – Khashaa Jan 14 '15 at 15:07
  • It works. Should I conclude I cannot get rid of `data.frame` because the initial input is not a supported format to `data_frame` ? – Steven Beaupré Jan 14 '15 at 15:11
  • To my limited grasp of it, yes. And, supported formats are `data.frame`s or `list of data.frame`s, hence the speed improvement. – Khashaa Jan 14 '15 at 15:17
  • 2
    `data_frame()` isn't the same as `data.frame()`. `data.frame()` does lots and lots of things. `data_frame()` does only one - it creates one column for each input. – hadley Jan 15 '15 at 12:24
  • @hadley After reading thoroughly the vignette on data_frame(), it is now crystal clear to me. Thanks. – Steven Beaupré Jan 15 '15 at 15:53
  • Awesome - glad the vignette helped :) – hadley Jan 15 '15 at 20:38

1 Answers1

4

So I ended up running microbenchmark on the two approaches because it felt a bit odd using:

mc <- montreal %>% 
    gCentroid(byid=TRUE) %>% 
    data.frame %>% 
    bind_cols(., data_frame(name=montreal[["NOM"]]))

I tried with two different datasets:

world <- readOGR("data/world.json", "OGRGeoJSON")

wmbm = microbenchmark(
  base = world %>% 
    gCentroid(byid=TRUE) %>% 
    data.frame %>% 
    cbind(., name=world[["name"]]),
  dplyr = world %>% 
    gCentroid(byid=TRUE) %>% 
    data.frame %>% 
    bind_cols(., data_frame(name=world[["name"]])),
  times=100
)

Microbenchmark results:

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
  base 13.78396 14.08301 14.21357 14.12023 14.16435 20.04362   100
 dplyr 13.87098 14.10680 14.25245 14.14330 14.18020 17.63248   100

enter image description here

montreal <- readOGR("data/limadmin.json", "OGRGeoJSON")

lmbm = microbenchmark(
  base = montreal %>% 
    gCentroid(byid=TRUE) %>% 
    data.frame %>% 
    cbind(., name=montreal[["NOM"]]),
  dplyr = montreal %>% 
    gCentroid(byid=TRUE) %>% 
    data.frame %>% 
    bind_cols(., data_frame(name=montreal[["NOM"]])),
  times=100
  )

Microbenchmark results:

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
  base 1.597957 1.628723 1.736709 1.651747 1.686554 3.091738   100
 dplyr 1.621092 1.642678 1.756978 1.659041 1.739707 3.751866   100

enter image description here

No real conclusion here. Even though it seems a bit slower, I will stick with the dplyr-esque solution for consistency, I guess.

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77