4

I'm very new to R, but I have a relatively simple mutate command which I'm running on a large dataset (4,000,000+ rows x 150 columns) which currently takes 15+ hours to run.

The query uses an if statement which only looks at a couple of columns - a numeric variable and a date (nb - some of the dates are NA) and returns a binary flag. Works fine on a small example but seems very inefficient on anything past 1m rows.

I'm using R v3.5, RStudio 1.2.1335 & dplyr 0.8.0.1

## create dummy dataframe
df <- data.frame(diff = c(35,432,-278,6556,90,10,76,-24,6),
      date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                    "2018-05-30")))
library(dplyr)
df <- df %>%
    mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))

When i chunk the data down to a smaller size, the process takes:

  • 20k rows = 1.9 secs
  • 200k rows = 1.7 mins
  • 500k rows = 10 mins
  • 1m rows = 34 mins

And anything bigger than that is well over what I'm prepared to wait for this...

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
TimL
  • 211
  • 2
  • 11
  • One quick suggestion would be to try use `dtplyr` which uses `data.table` backend – erocoar Sep 05 '19 at 13:03
  • 3
    or just use `data.table` directly... – Cath Sep 05 '19 at 13:06
  • 6
    Things to try: (1) `+(diff > 21 | is.na(date_a)` . (2) data.table directly not using any other packages -- examples I have seen indicate that using data.table as a lower layer does not run as fast as if you use it directly and (3) note that data.table has an `IDate` class. – G. Grothendieck Sep 05 '19 at 13:06
  • @G.Grothendieck worth submitting as an answer? – zx8754 Sep 05 '19 at 13:12
  • Try `as.numeric(diff > 21 | is.na(date_a))` – Clemsang Sep 05 '19 at 13:13
  • @Clemsang it's basically what G. Grothendieck put in his comment, though `as.numeric` is more "acceptable" than `+` to convert logical to 0/1 – Cath Sep 05 '19 at 13:20
  • @Cath `as.numeric` is more efficient as you don't have to sum and convert – Clemsang Sep 05 '19 at 13:21
  • @Clemsang not sure sum is used when there is only one item – Cath Sep 05 '19 at 13:25
  • 2
    No way this is a dplyr vs data.table issue. On my machine (nothing special) with a 4M data frame like the one you described, the code took <0.5s to complete. Are you sure the R computation takes 15 hours? What sort of computer are you using? – Hugh Sep 05 '19 at 13:31
  • 2
    @Hugh If you have the example data and code already, it would be nice to provide a benchmarking answer as a wiki post. – zx8754 Sep 05 '19 at 13:33
  • TimL, these examples should demonstrate how using `+` instead of `ifelse` converts it to numeric: `as.numeric(TRUE)`, `+(TRUE)`, `as.numeric(10 == 10)`, `+(10 == 10)`, – zx8754 Sep 05 '19 at 13:38
  • @zx8754 Thanks a lot for that, as I say I'm fairly new to this and scrabbling around a bit. Am running the suggested answers and will feedback – TimL Sep 05 '19 at 13:48
  • @Hugh I don't think any of the processes I'm running in situ would be that quick! Bear in mind it's 4m x 150 columns if that impacts at all, and those columns are in all sorts of formats. – TimL Sep 05 '19 at 13:48
  • When you say 'processes' (plural), what do you mean? Are you running multiple scripts in parallel? – Hugh Sep 05 '19 at 13:50
  • @Hugh No - I just mean the consecutive commands to get to the point where I run the query in my initial post. So, for example, the previous command I run is a subtraction of one date from another `df$diff <- as.Date(df$date1) - as.Date(df$date2)` to get the 'diff' value, which takes under a second. So not sure why this mutate takes so long... – TimL Sep 05 '19 at 14:31
  • There absolutely no chance that an `if_else` on 4MM rows will run 15+ hours (or even seconds) not even on a 4B rows. You have some software/hardware issue. Maybe you need to restart, or buy a new computer. The only other thing I can think of is if you forgot to `ungroup()` or such. I don't use dplyr myself, but I remember one of my co-workers had issues with that. – David Arenburg Sep 05 '19 at 14:52
  • David's idea is a good one: perhaps the actual data has been `group_by`'d at some point, possibly at every row. Does putting `ungroup %>%` before the `mutate` help? – Hugh Sep 05 '19 at 14:58
  • 2
    @DavidArenburg that's it! Might have been blindingly obvious, but if I add an `ungroup()` to the pipe then it works fine. I don't think I'd read anywhere that 'grouped' variables stay grouped within the dataframe until they are 'ungrouped', but thanks! – TimL Sep 05 '19 at 14:58
  • @Hugh and thanks to you for pursuing this line of enquiry! – TimL Sep 05 '19 at 14:59

3 Answers3

2

For reference, I cannot reproduce these timings even remotely. I suspect that something outside dplyr and probably outside R, such as your connection to the CPU or some other process in the operating system is behind the slow speeds you encountered.

hh_ss <- function (form = "%H:%M:%S") {
  cat(format(Sys.time(), format = form), "\n")
}
hh_ss()
#> 23:36:22
df <- data.frame(diff = rep_len(c(35,432,-278,6556,90,10,76,-24,6), 4e6),
                 date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                                    "2018-05-30", 
                                    "2018-05-30")))
hh_ss()
#> 23:36:22
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- df %>%
  mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))
hh_ss()
#> 23:36:23

Created on 2019-09-05 by the reprex package (v0.3.0)

Adding 150 columns changes little:

hh_ss <- function (form = "%H:%M:%S") {
  cat(format(Sys.time(), format = form), "\n")
}
hh_ss()
#> 23:40:16
df <- data.frame(diff = rep_len(c(35,432,-278,6556,90,10,76,-24,6), 4e6),
                 date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                                    "2018-05-30", 
                                    "2018-05-30")))

# Add 150 columns
for (j in paste0(c(letters, LETTERS)[1:50], 1:150)) {
  df[[j]] <- 0L
}


hh_ss()
#> 23:40:17
suppressPackageStartupMessages(library(dplyr))
df <- df %>%
  mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))
hh_ss()
#> 23:40:18

Created on 2019-09-05 by the reprex package (v0.3.0)

Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Might want to use [microbenchmark package](https://cran.r-project.org/web/packages/microbenchmark/). – zx8754 Sep 05 '19 at 13:53
  • 1
    microbenchmark is used for code over very short times (i.e. < 1s). The fact that we would consider it basically proves that the code would not take an hour! – Hugh Sep 05 '19 at 13:58
1

You can use as.numeric to convert a boolean to 0 and 1:

df <- df %>%
    mutate(flag = as.numeric(diff > 21 | is.na(date_a)))
Clemsang
  • 5,053
  • 3
  • 23
  • 41
1

For this particular problem you can use the fact that the boolean values of the will map to the values your if_else statement is generating.

df <- df %>%
    mutate(flag = as.numeric(diff > 21 | is.na(date_a)))

if_else and ifelse, while vectorised, do need to create all of true/yes and all of false/no, before subsetting between them and this can be expensive if they are large especially if compared to your available memory.

If you want something other than 0 and 1 you can use the following trick:

c("no","yes")[as.numeric(diff > 21 | is.na(date_a))+1]
James
  • 65,548
  • 14
  • 155
  • 193
  • 3
    I wouldn't call it a hack—it fits into the definition of a boolean – camille Sep 05 '19 at 13:27
  • @camille Yes, it was the application of it that was the hack, not the property. Changed to "fact", thanks for pointing out. – James Sep 05 '19 at 13:33