Indexing data frame takes too long

Question

I've got some code looking like this:

library(stringi)

df_values <- data.frame(value = stri_rand_strings(n = 500,
                                                  length = 30))

df_keys <- tibble(key = sample(x = 1:500,
                               size = 25000,
                               replace = TRUE))

# start timer
start_time <- Sys.time()

df_keys |>
 rowwise() |>
 mutate(value = df_values$value[key])

# end timer
end_time <- Sys.time()

end_time - start_time

Which requires very much time to run, but I can't figure out why. The code above only requires 0.3003931 seconds. For my real code I subsetted the tibble with head(n) and got following times:

n	time in secs
50	1.993536
100	3.731
200	6.550074
300	9.500864
500	15.68515
1,000	32.19306
...	seems to be linear
20,000	maybe 10 minutes

Does someone have an idea what could be wrong with my code? I guess it's the indexing-part df_values$value[key]? But my original df_values also is a data.frame with 500 obs.

`rowwise()` is slow! it's looping through and doing one row at a time. So this is basically a giant for loop by row. — , Jun 15 '22 at 18:29
@Adam ok, but why does my MRE work fine? Or better: what can I do to fasten up my real code? — user1, Jun 15 '22 at 18:35
If you need to index by rows, you are much better off working with matrices than dataframes or tibbles. Those are good for indexing by columns, but not by rows. The disadvantage is that all entries need to be the same type, you can't mix strings with numbers, etc. — user2554330, Jun 15 '22 at 18:37
A likely reason your real data is so much slower is that you have more than one column. — user2554330, Jun 15 '22 at 18:38
@user2554330 I got mixed types, but thanks for your comment! — user1, Jun 15 '22 at 18:40
Just remove the rowwise(). You don't need it here for this operation, unless I am missing something. I just tried the reprex and they are identical with and without. — , Jun 15 '22 at 18:40
@Adam I need `rowwise()` later on in my mutate. But a solution would be to use two `mutate`s and apply `rowwise()` between them — user1, Jun 15 '22 at 18:45
Yeah that's right. Sometimes you need to flip it on and off and `ungroup()` for computational efficiency. You nailed it, just gotta break it up. — , Jun 15 '22 at 18:47

PaulS · Accepted Answer · 2022-06-15T18:48:42.920

3

A possible solution, in base R. As we can see, the execution time takes only 1% of the time, compared to your dplyr approach. Even removing rowwise, the execution time is extremely faster with a base R approach.

library(tidyverse)
library(stringi)

# start timer
start_time <- Sys.time()

df_keys |>
  rowwise() |>
  mutate(value = df_values$value[key])
#> # A tibble: 25,000 × 2
#> # Rowwise: 
#>      key value                         
#>    <int> <chr>                         
#>  1   287 BeFLZsuRxlKJAJLgOnH1SO2f6kjpPH
#>  2   292 yG1JoxKRzSDnBlk4fJKDcKwzAUGwOy
#>  3   334 38pJ1h3RaTTSDgcf7gyCuW2NqFyncZ
#>  4   120 LqqCmTiMQV50hV0c0yYzk94AtpV7I6
#>  5   233 62BsX6NAEQqYx5wjm5ienCYgDmvJDb
#>  6   413 OB2MqTt1SOTb3irKlLEBtr4MfvuWW5
#>  7   123 4IKKUTli7c1l8GwU8TTpWHLHirGCy8
#>  8   400 aDnB9PwIKQkdfAW5kwzM215vU9aCNk
#>  9   214 aOsJkVENbncaHESiU2rwmfXqY5yVsK
#> 10   332 v4DfYVOr9kedtIwnWFlefDfFhHJ25R
#> # … with 24,990 more rows

# end timer
end_time <- Sys.time()

end_time - start_time

#> Time difference of 0.1876147 secs

start_time <- Sys.time()
df_keys$value <- df_values$value[df_keys$key]
end_time <- Sys.time()

end_time - start_time

#> Time difference of 0.002212286 secs

edited Jun 15 '22 at 18:48

answered Jun 15 '22 at 18:35

PaulS

21,159
2
9
26

1

`rowwise()` is indeed the problem. Using base r, my original data only takes 0.0856111 seconds. – user1 Jun 15 '22 at 18:41
Yes, @user1: `rowwise` is a problem, but even removing `rowwise`, the execution time is extremely faster with a `base R` approach. – PaulS Jun 15 '22 at 18:44
Is it possible to do within a pipe? – user1 Jun 15 '22 at 18:47
With a pipe, the approach will no longer be a `base R` one, I guess -- and the speed will be lower. – PaulS Jun 15 '22 at 18:51
Isn't `|>` the base R pipe? https://stackoverflow.com/questions/65329335/how-to-pipe-purely-in-base-r-base-pipe – user1 Jun 15 '22 at 18:53
1

@user1 `df_keys <- df_keys |> transform(value = df_values$value[df_keys$key])`. You will lose some speed with transform. But `|>` is implemented in a way that shouldn't lose you any speed, because it was done at the interpreter. Really though `mutate()` should be fast enough, just `ungroup()` before the line and then add `rowwise()` again after if you need it again. – Jun 15 '22 at 18:57
@Adam is right: there is a decrease in speed, but just a small decrease! – PaulS Jun 15 '22 at 19:02

Indexing data frame takes too long

1 Answers1