How to speed up simple mutate command in R

Question

I'm very new to R, but I have a relatively simple mutate command which I'm running on a large dataset (4,000,000+ rows x 150 columns) which currently takes 15+ hours to run.

The query uses an if statement which only looks at a couple of columns - a numeric variable and a date (nb - some of the dates are NA) and returns a binary flag. Works fine on a small example but seems very inefficient on anything past 1m rows.

I'm using R v3.5, RStudio 1.2.1335 & dplyr 0.8.0.1

## create dummy dataframe
df <- data.frame(diff = c(35,432,-278,6556,90,10,76,-24,6),
      date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                    "2018-05-30")))
library(dplyr)
df <- df %>%
    mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))

When i chunk the data down to a smaller size, the process takes:

20k rows = 1.9 secs
200k rows = 1.7 mins
500k rows = 10 mins
1m rows = 34 mins

And anything bigger than that is well over what I'm prepared to wait for this...

One quick suggestion would be to try use `dtplyr` which uses `data.table` backend — erocoar, Sep 05 '19 at 13:03
Things to try: (1) `+(diff > 21 | is.na(date_a)` . (2) data.table directly not using any other packages -- examples I have seen indicate that using data.table as a lower layer does not run as fast as if you use it directly and (3) note that data.table has an `IDate` class. — G. Grothendieck, Sep 05 '19 at 13:06
@Clemsang it's basically what G. Grothendieck put in his comment, though `as.numeric` is more "acceptable" than `+` to convert logical to 0/1 — Cath, Sep 05 '19 at 13:20
@Cath `as.numeric` is more efficient as you don't have to sum and convert — Clemsang, Sep 05 '19 at 13:21
No way this is a dplyr vs data.table issue. On my machine (nothing special) with a 4M data frame like the one you described, the code took <0.5s to complete. Are you sure the R computation takes 15 hours? What sort of computer are you using? — Hugh, Sep 05 '19 at 13:31
@Hugh If you have the example data and code already, it would be nice to provide a benchmarking answer as a wiki post. — zx8754, Sep 05 '19 at 13:33
TimL, these examples should demonstrate how using `+` instead of `ifelse` converts it to numeric: `as.numeric(TRUE)`, `+(TRUE)`, `as.numeric(10 == 10)`, `+(10 == 10)`, — zx8754, Sep 05 '19 at 13:38
@zx8754 Thanks a lot for that, as I say I'm fairly new to this and scrabbling around a bit. Am running the suggested answers and will feedback — TimL, Sep 05 '19 at 13:48
@Hugh I don't think any of the processes I'm running in situ would be that quick! Bear in mind it's 4m x 150 columns if that impacts at all, and those columns are in all sorts of formats. — TimL, Sep 05 '19 at 13:48
When you say 'processes' (plural), what do you mean? Are you running multiple scripts in parallel? — Hugh, Sep 05 '19 at 13:50
@Hugh No - I just mean the consecutive commands to get to the point where I run the query in my initial post. So, for example, the previous command I run is a subtraction of one date from another `df$diff <- as.Date(df$date1) - as.Date(df$date2)` to get the 'diff' value, which takes under a second. So not sure why this mutate takes so long... — TimL, Sep 05 '19 at 14:31
There absolutely no chance that an `if_else` on 4MM rows will run 15+ hours (or even seconds) not even on a 4B rows. You have some software/hardware issue. Maybe you need to restart, or buy a new computer. The only other thing I can think of is if you forgot to `ungroup()` or such. I don't use dplyr myself, but I remember one of my co-workers had issues with that. — David Arenburg, Sep 05 '19 at 14:52
David's idea is a good one: perhaps the actual data has been `group_by`'d at some point, possibly at every row. Does putting `ungroup %>%` before the `mutate` help? — Hugh, Sep 05 '19 at 14:58
@DavidArenburg that's it! Might have been blindingly obvious, but if I add an `ungroup()` to the pipe then it works fine. I don't think I'd read anywhere that 'grouped' variables stay grouped within the dataframe until they are 'ungrouped', but thanks! — TimL, Sep 05 '19 at 14:58

score 2 · Accepted Answer · answered Sep 05 '19 at 13:40

For reference, I cannot reproduce these timings even remotely. I suspect that something outside dplyr and probably outside R, such as your connection to the CPU or some other process in the operating system is behind the slow speeds you encountered.

hh_ss <- function (form = "%H:%M:%S") {
  cat(format(Sys.time(), format = form), "\n")
}
hh_ss()
#> 23:36:22
df <- data.frame(diff = rep_len(c(35,432,-278,6556,90,10,76,-24,6), 4e6),
                 date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                                    "2018-05-30", 
                                    "2018-05-30")))
hh_ss()
#> 23:36:22
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- df %>%
  mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))
hh_ss()
#> 23:36:23

^{Created on 2019-09-05 by the reprex package (v0.3.0)}

Adding 150 columns changes little:

hh_ss <- function (form = "%H:%M:%S") {
  cat(format(Sys.time(), format = form), "\n")
}
hh_ss()
#> 23:40:16
df <- data.frame(diff = rep_len(c(35,432,-278,6556,90,10,76,-24,6), 4e6),
                 date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
                                    "2019-03-09",NA,"2018-09-06","2017-06-01",
                                    "2018-05-30", 
                                    "2018-05-30")))

# Add 150 columns
for (j in paste0(c(letters, LETTERS)[1:50], 1:150)) {
  df[[j]] <- 0L
}


hh_ss()
#> 23:40:17
suppressPackageStartupMessages(library(dplyr))
df <- df %>%
  mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))
hh_ss()
#> 23:40:18

^{Created on 2019-09-05 by the reprex package (v0.3.0)}

Might want to use [microbenchmark package](https://cran.r-project.org/web/packages/microbenchmark/). — zx8754, Sep 05 '19 at 13:53
microbenchmark is used for code over very short times (i.e. < 1s). The fact that we would consider it basically proves that the code would not take an hour! — Hugh, Sep 05 '19 at 13:58

score 1 · Answer 2 · answered Sep 05 '19 at 13:23

1

You can use as.numeric to convert a boolean to 0 and 1:

df <- df %>%
    mutate(flag = as.numeric(diff > 21 | is.na(date_a)))

answered Sep 05 '19 at 13:23

Clemsang

5,053
3
23
41

James · Answer 3 · 2019-09-05T13:28:43.087

1

For this particular problem you can use the fact that the boolean values of the will map to the values your if_else statement is generating.

df <- df %>%
    mutate(flag = as.numeric(diff > 21 | is.na(date_a)))

if_else and ifelse, while vectorised, do need to create all of true/yes and all of false/no, before subsetting between them and this can be expensive if they are large especially if compared to your available memory.

If you want something other than 0 and 1 you can use the following trick:

c("no","yes")[as.numeric(diff > 21 | is.na(date_a))+1]

edited Sep 05 '19 at 13:28

answered Sep 05 '19 at 13:25

James

65,548
14
155
193

3

I wouldn't call it a hack—it fits into the definition of a boolean – camille Sep 05 '19 at 13:27
@camille Yes, it was the application of it that was the hack, not the property. Changed to "fact", thanks for pointing out. – James Sep 05 '19 at 13:33

How to speed up simple mutate command in R

3 Answers3