3

I have a large dataset that I try to manipulate using dplyr. My data wrangling task requires row level string manipulation.

I am using the default rowwise() function, and the code is working. However, the operation is taking a lot of time to complete.

VR_vehicle_GPSLocation = c("12.36556|0.74518153|xxxxxxxxxx", 
      "-51.75810|165.55526|xxxxxxxxxx", 
      "GPS nicht verfügbar",
      "48.77410|171.08364|xxxxxxxxxx", 
      "GPS Not Available",
      "0|0|N/R",
      "32.18661| 170.56615|xxxxxxxxxx")
df = data.frame(VR_vehicle_GPSLocation)

jobs_location <- df %>%
     rowwise() %>% 
     mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
            longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
     select(latitude, longitude)

In order to speed up the process, I explored the multidyplyr library without success, I am getting an error message saying that my dataset is not a data frame.

jobs_location <- jobs %>%
  partition() %>%
  rowwise() %>% 
  mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
         longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
  collect()
Michael
  • 2,436
  • 1
  • 36
  • 57
  • 1
    First of all we don't have a MWE, so we can't really help you. Second of all, by looking at your code, I doubt you need to run `strsplit` by row. You probably could easily do the whole thing at once using `data.table::tstrsplit`. Third of all, if you want fast splits, don't use regex and don't run `as.charcter` per row (twice each time!). i.e., `VR_vehicle_GPSLocation` should be already a character before you start doing stuff and instead of `'\\|'` use `|` combined with `fixed = TRUE`. But there again, we need a MWE. – David Arenburg Feb 20 '17 at 12:11
  • Thanks @DavidArenburg I am busy creating an example so it will make more sense. – Michael Feb 20 '17 at 12:27
  • 1
    `library(data.table) ; setDT(df)[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]` and you good to go. – David Arenburg Feb 20 '17 at 12:42
  • Thanks, it's clearly better. I also read your list of rules of thumbs and it makes perfect sense. I will, next time, think about alternative instead of using `dplyr`. – Michael Feb 20 '17 at 12:47
  • 2
    See my updated comment above. You need to make sure to avoid the gibrish before splitting, otherwise the resulting columns won't be numeric. I've used `|` as the initial filter (the `grep` command) but I'm not sure if this always true in your real data. Either-way, they idea here is not about `dplyr` or not, rather not to do rowwise operations when stuff can be easily vectorized. Also, will be interesting if you'll update if there was a performance improvement. See also [this](http://stackoverflow.com/questions/35022602/rowwise-operation-with-dplyr/35036881#35036881) as this looks similar. – David Arenburg Feb 20 '17 at 12:51

1 Answers1

2

All credit to @DavidArenburg

I approached the problem from a non-efficient angle. Using a vectorized approach clearly improve the performance.

For the sake of completeness, I run the code on a random tiny subset of the whole dataset to evaluate the performance and clearly vectorization is the way to go for my problem.

Finally it is important to mention that a pre-cleaning task is needed to ensure the resulting transformation is numeric (refer to David's comment for more details)

library(dplyr)
library(data.table)
library(microbenchmark)
library(ggplot2)

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                   c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]
)

autoplot(mbm)

A picture is worth a thousand words

enter image description here


Another suggestion from David is to convert the data to numeric after tstrsplit. I added two functions one that is doing the conversion on the overall columns and one that apply the type conversion after the split.

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)],

  c = sapply(setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2)], as.numeric),

  d = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                        c("latitude", "longitude") := lapply(tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2), as.numeric)]
)
autoplot(mbm)

The last variant (d) is clearly the winner.

enter image description here

Michael
  • 2,436
  • 1
  • 36
  • 57
  • 1
    I think that explicitly converting `latitude` and `longitude` to numeric after splitting instead of using `type.convert = TRUE` within `tstrsplit` will probably double the speed. – David Arenburg Feb 21 '17 at 07:33
  • Thanks, @DavidArenburg - I elaborated the example a bit more based on your suggestion. – Michael Feb 21 '17 at 11:10