I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.
Here is an example df
:
df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
Fruit = c("apple", "orange", "pear",
"orange", "pear",
"pear", "apple",
"pear", "apple", "orange",
"pear", "apple", "orange"),
Order = c(1, 2, 3,
1, 2,
1, 2,
1, 2, 3,
1, 1, 1))
In each Shop
, Fruit
is picked by a customer in a given Order
.
The following function creates an m x n
pairwise matrix:
loop.function <- function(df){
fruits <- unique(df$Fruit)
nt <- length(fruits)
mat <- array(dim=c(nt,nt))
for(m in 1:nt){
for(n in 1:nt){
## filter df for each pair of fruit
xm <- df[df$Fruit == fruits[m],]
xn <- df[df$Fruit == fruits[n],]
## index instances when a pair of fruit are picked in same shop
mm <- match(xm$Shop, xn$Shop)
## filter xm and xn based on mm
xm <- xm[! is.na(mm),]
xn <- xn[mm[! is.na(mm)],]
## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
mat[m,n] <- sum(xn$Order < xm$Order)
}
}
row.names(mat) <- fruits
colnames(mat) <- fruits
return(mat)
}
Where mat[m,n]
is the number of times fruits[m]
is picked after fruits[n]
. And mat[n,m]
is the number of times fruits[m]
is picked before fruits[n]
. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop
E
).
See expected output:
>loop.function(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
You can see here that pear
is chosen twice before apple
(in Shop
C
and D
), and apple
is chosen once before pear
(in Shop
A
).
I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.
(I have a feeling there may be a solution using outer()
, but my knowledge of vectorizing functions is still very limited.)
Update
See benchmarking with real data times = 10000
for loop.function()
, tidyverse.function()
, loop.function2()
, datatable.function()
and loop.function.TMS()
:
Unit: milliseconds
expr min lq mean median uq max neval cld
loop.function(dat) 186.588600 202.78350 225.724249 215.56575 234.035750 999.8234 10000 e
tidyverse.function(dat) 21.523400 22.93695 26.795815 23.67290 26.862700 295.7456 10000 c
loop.function2(dat) 119.695400 126.48825 142.568758 135.23555 148.876100 929.0066 10000 d
datatable.function(dat) 8.517600 9.28085 10.644163 9.97835 10.766749 215.3245 10000 b
loop.function.TMS(dat) 4.482001 5.08030 5.916408 5.38215 5.833699 77.1935 10000 a
Probably the most interesting result for me is the performance of tidyverse.function()
on the real data. I will have to try add Rccp
solutions at a later date - I'm having trouble making them work on the real data.
I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!