15

I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.

Here is an example df:

  df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
                   Fruit = c("apple", "orange", "pear",
                             "orange", "pear",
                             "pear", "apple",
                             "pear", "apple", "orange",
                             "pear", "apple", "orange"),
                   Order = c(1, 2, 3,
                            1, 2,
                            1, 2, 
                            1, 2, 3,
                            1, 1, 1))

In each Shop, Fruit is picked by a customer in a given Order.

The following function creates an m x n pairwise matrix:

loop.function <- function(df){
  
  fruits <- unique(df$Fruit)
  nt <- length(fruits)
  mat <- array(dim=c(nt,nt))
  
  for(m in 1:nt){
    
    for(n in 1:nt){
      
      ## filter df for each pair of fruit
      xm <- df[df$Fruit == fruits[m],]
      xn <- df[df$Fruit == fruits[n],]
      
      ## index instances when a pair of fruit are picked in same shop
      mm <- match(xm$Shop, xn$Shop)
      
      ## filter xm and xn based on mm
      xm <- xm[! is.na(mm),]
      xn <- xn[mm[! is.na(mm)],]
      
      ## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
      mat[m,n] <- sum(xn$Order < xm$Order)
    }
  }
  
  row.names(mat) <- fruits
  colnames(mat) <- fruits
  
  return(mat)
}

Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).

See expected output:

>loop.function(df)
       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).

I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.

(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)

Update

See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():

Unit: milliseconds
                    expr            min        lq       mean    median         uq      max     neval   cld
      loop.function(dat)     186.588600 202.78350 225.724249 215.56575 234.035750 999.8234    10000     e
     tidyverse.function(dat)  21.523400  22.93695  26.795815  23.67290  26.862700 295.7456    10000   c 
     loop.function2(dat)     119.695400 126.48825 142.568758 135.23555 148.876100 929.0066    10000    d
 datatable.function(dat)       8.517600   9.28085  10.644163   9.97835  10.766749 215.3245    10000  b 
  loop.function.TMS(dat)       4.482001   5.08030   5.916408   5.38215   5.833699  77.1935    10000 a 

Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.

I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!

jayb
  • 555
  • 3
  • 15
  • hi, what is the dimensions of your actual dataset and how many times will you be calling this function? – chinsoon12 Jul 08 '20 at 23:18
  • and also should Order be 1,2,3 instead of 1,1,1 for Shop E? – chinsoon12 Jul 08 '20 at 23:36
  • On size: typically, df might contain ~15 fruits ordered in ~100 shops. It is called ~1K times in a single run, however, with bootstrapping there is 10k runs. On Shop E: no, this was not a mistake, I wanted the example to include a case where all fruits were picked simultaneously, as it is important that the function ignores these cases – jayb Jul 09 '20 at 09:26
  • @chinsoon12 There is some similarity with this question, but the ordering in my problem adds an extra layer of complexity: – jayb Jul 09 '20 at 13:53
  • Is it safe to assume that the shops are always sorted? If not, is it safe to sort them? – Cole Jul 11 '20 at 18:27
  • 4
    @jayb Thanks for posting a small toy data set for people to try out their code on. However, because your question is about speed and performance, can you please also provide data set(s) of size and complexity relevant for benchmarking in your question. Without such data it will be hard/impossible to evaluate the answers. Also describe the improvements you expect. Thank you. – Henrik Jul 11 '20 at 19:04
  • As loops perform well, I added a Rcpp solution : very nice performance – Waldi Jul 12 '20 at 19:22
  • Will the order ever be e.g., `c(1, 1, 2, 3)` or will it always be `c(1, 1, 1)` or in sequence? – Andrew Jul 14 '20 at 13:40
  • @jayb, your question generated many answers, could you update the benchmarking on real dataset? Thanks for your feedback – Waldi Jul 15 '20 at 23:17
  • @jayb I was working on a problem which reminded of your question: I think the `arulesSequence` package may be relevant to you. See e.g. this tutorial: [Sequential Pattern Mining in R](https://blog.revolutionanalytics.com/2019/02/sequential-pattern-mining-in-r.html) – Henrik Aug 20 '20 at 09:10

4 Answers4

10

A data.table solution :

library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
           ,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
      ,Fruit~i.Fruit,value.var='cnt')

    Fruit apple orange pear
1:  apple     0      0    2
2: orange     2      0    1
3:   pear     1      2    0

The Shop index isn't necessary for this example, but will probably improve performance on a larger dataset.

As the question raised many comments on performance, I decided to check what Rcpp could bring:

library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {

std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n  = FruitLevels.length();

std::string currentShop = "";
int order, fruit, i, f;

NumericMatrix result(n,n);
NumericVector fruitOrder(n);

for (i=0;i<Fruit.length();i++){
    if (currentShop != Shop[i]) {
       //Init counter for each shop
       currentShop = Shop[i];
       std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
    }
    order = Order[i];
    fruit = FruitInt[i];
    fruitOrder[fruit-1] = order;
    for (f=0;f<n;f++) {
       if (order > fruitOrder[f] & fruitOrder[f]>0 ) { 
         result(fruit-1,f) = result(fruit-1,f)+1; 
    }
  }
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')

rcppPair(df)

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

On the example dataset, this runs >500 times faster than the data.table solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.

Considering the few minutes spent to find the 3 lines of code for the data.table solution, compared to the much longer Rcpp solution / debugging process, I wouldn't recommend to go for Rcpp here unless there's a real performance bottleneck.

Interesting however to remember that if performance is a must, Rcpp might be worth the effort.

Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Code-golf (which eventually equates into speed): I don't know that `&i.Fruit!=Fruit` is changing anything with this sample data. Are you certain it's necessary logically? – r2evans Jul 10 '20 at 22:24
  • 1
    @r2evans, I thought this would be useful not to count for example an Apple taken first and then third, with an orange in between. From the description I understand that we only want to count different fruits. – Waldi Jul 10 '20 at 22:30
  • I was thinking that, too, and that's definitely the safer approach. If the OP certifies that the input is already unique per-`Shop` then I think my suggestion holds. (It's not hard to ensure that before `matrix`ification. – r2evans Jul 10 '20 at 22:38
  • FYI, on my machine, your solution (functionized) takes just over half the time of the `loop.function` (since OP wants to *"speed up a function"*). – r2evans Jul 10 '20 at 22:41
  • 1
    @r2evans, thanks for the benchmarking. It will be interesting to see the results with the real dataset (100 shops / 15 fruits) – Waldi Jul 10 '20 at 22:51
  • Maybe add a non-equi condition to the first join, to include only relevant 'Order' combinations (and avoid the `allow.cartesian` "explosion"). Then calculations in the second step can be simplified to count rows. `d = df[df, on = .(Shop, Order < Order), .(before = Fruit, after = i.Fruit)]`; `dcast(d[!is.na(before), .N, by = .(before, after)], after ~ before, value.var = "N", fill = 0)`. – Henrik Jul 10 '20 at 23:21
  • @Henrik, thanks for your suggestion. I get your point and also made me thoughts about it. I came to the conclusion that the cartesian explosion would anyway be limited by the Shop join. In terms of microbenchmarking, for the example dataset, your solution is a bit slower, but this of course has to be tested on a real dataset where the advantages of it should be more visible. – Waldi Jul 11 '20 at 06:54
  • With @Henriks suggestion, you can get the same output with ```table(i.Fruit, Fruit)``` as the ```j``` for the non-equi approach. So ```df[df, on = .(Shop, Order < Order), table(i.Fruit, Fruit), allow.cartesian = T, nomatch = 0L]```. I am very interested in the real-world performance of your answer, too - benchmarking in milliseconds is rarely telling. – Cole Jul 11 '20 at 12:45
  • @Waldi Thanks very much for this solution - I'm not familiar with data.table, but probably I should be, given the performance. One problem in using this solution in my real algorithm though is that the output is a list, which I can't index in the same way as a matrix – jayb Jul 16 '20 at 16:24
  • @jayb, if you're looking for [performance](https://h2oai.github.io/db-benchmark/), data.table is worth a try ;). conversion to matrix is as simple as : `as.matrix(dt)` – Waldi Jul 16 '20 at 17:14
7

Here is an approach that makes simple modifications to make it 5x faster.

loop.function2 <- function(df){

    spl_df = split(df[, c(1L, 3L)], df[[2L]])
    
    mat <- array(0L,
                 dim=c(length(spl_df), length(spl_df)),
                 dimnames = list(names(spl_df), names(spl_df)))
    
    for (m in 1:(length(spl_df) - 1L)) {
        xm = spl_df[[m]]
        mShop = xm$Shop
        for (n in ((1+m):length(spl_df))) {
            xn = spl_df[[n]]
            mm = match(mShop, xn$Shop)
            inds = which(!is.na(mm))
            mOrder = xm[inds, "Order"]
            nOrder = xn[mm[inds], "Order"]

            mat[m, n] <- sum(nOrder < mOrder)
            mat[n, m] <- sum(mOrder < nOrder)
        }
    }
    mat
}

There are 3 main concepts:

  1. The original df[df$Fruits == fruits[m], ] lines were inefficient as you would be making the same comparison length(Fruits)^2 times. Instead, we can use split() which means we are only scanning the Fruits once.
  2. There was a lot of use of df$var which will extract the vector during each loop. Here, we place the assignment of xm outside of the inner loop and we try to minimize what we need to subset / extract.
  3. I changed it to be closer to combn as we can re-use our match() condition by doing both sum(xmOrder > xnOrder) and then switching it to sum(xmOrder < xnOrder).

Performance:

bench::mark(loop.function(df), loop.function2(df))

# A tibble: 2 x 13
##  expression              min median
##  <bch:expr>         <bch:tm> <bch:>
##1 loop.function(df)    3.57ms 4.34ms
##2 loop.function2(df)  677.2us 858.6us

My hunch is that for your larger dataset, @Waldi's solution will be faster. But for smaller datasets, this should be pretty perfomant.

Finally, here's yet another approach that seems to be slower than @Waldi:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
    int x_size = x.size();
    IntegerMatrix ans(x_size, x_size);
    
    for (int m = 0; m < x_size - 1; m++) {
        DataFrame xm = x[m];
        CharacterVector mShop = xm[0];
        IntegerVector mOrder = xm[1];
        int nrows = mShop.size();
        for (int n = m + 1; n < x_size; n++) {
            DataFrame xn = x[n];
            CharacterVector nShop = xn[0];
            IntegerVector nOrder = xn[1];
            for (int i = 0; i < nrows; i++) {
                for (int j = 0; j < nrows; j++) {
                    if (mShop[i] == nShop[j]) {
                        if (mOrder[i] > nOrder[j])
                           ans(m, n)++;
                        else
                            ans(n, m)++;
                        break;
                    }
                }
            }
        }
    }
    return(ans);
}
loop_wrapper = function(df) {
  loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks very much for the solutions - there's definitely a lot to learn in terms of speeding up the existing loop here, especially on only keeping absolutely necessary code within each loop. – jayb Jul 16 '20 at 16:14
5

It seems not possible to vectorize over the original data frame df. But if you transform it using reshape2::dcast(), to have one line per each shop:

require(reshape2)

df$Fruit <- as.character(df$Fruit)

by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")

#   Shop apple orange pear
# 1    A     1      2    3
# 2    B    NA      1    2
# 3    C     2     NA    1
# 4    D     2      3    1
# 5    E     1      1    1

..., then you can easily vectorize at least for each combination of [m, n]:

fruits <- unique(df$Fruit)
outer(fruits, fruits, 
    Vectorize(
        function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), 
        c("m", "n")
    ), 
    by_shop)
#      [,1] [,2] [,3]
# [1,]    0    0    2
# [2,]    2    0    1
# [3,]    1    2    0

This is probably the solution you desired to do with outer. Much faster solution would be a true vectorization over all combinations of fruits [m, n], but I've been thinking about it and I don't see any way to do it. So I had to use the Vectorize function which of course is much slower than true vectorization.

Benchmark comparison with your original function:

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
     loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923   100
 loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813   100

Function & benchmark code (also added the preservation of the dimnames):

require(reshape2)   
loop.function.TMS <- function(df) { 
    df$Fruit <- as.character(df$Fruit)
    by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
    fruits <- unique(df$Fruit)
    o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
    colnames(o) <- rownames(o) <- fruits
    o
}

require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))
Tomas
  • 57,621
  • 49
  • 238
  • 373
  • Thanks for this - it is a really interesting use of `outer()` - I've never seen it used with `Vectorize()` in this way. – jayb Jul 16 '20 at 16:16
2

OK, here is a solution:

library(tidyverse)

# a dataframe with all fruit combinations
df_compare <-  expand.grid(row_fruit = unique(df$Fruit)
                           , column_fruit = unique(df$Fruit)
                           , stringsAsFactors = FALSE)

df_compare %>%
    left_join(df, by = c("row_fruit" = "Fruit")) %>%
    left_join(df, by = c("column_fruit" = "Fruit")) %>%
    filter(Shop.x == Shop.y &
               Order.x < Order.y) %>%
    group_by(row_fruit, column_fruit) %>%
    summarise(obs = n()) %>%
    pivot_wider(names_from = row_fruit, values_from = obs) %>%
    arrange(column_fruit) %>%
    mutate_if(is.numeric, function(x) replace_na(x, 0)) %>%
    column_to_rownames("column_fruit") %>%
    as.matrix()

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

If you don't know what is going on in the second code part (df_compare %>% ...), read the "pipe" (%>%) as 'then'. Run the code from df_compare to just before any of the pipes to see the intermediate results.

Georgery
  • 7,643
  • 1
  • 19
  • 52
  • Thanks for the suggested solution. I should have clarified that it is important that the structure (and order) of the output is preserved according to the output of loop.function(). – jayb Jul 08 '20 at 15:03
  • I made some changes to your code to return the same output as `loop.function()`. Following `column_to_rownames("row_fruit") %>%` I added ` select(all_of(unique(df$Fruit))) %>%` and following `as.matrix()`, I added `%>% replace_na(replace = 0)` . The code now returns the same output, but it does not improve speed - the reason I'm interested in vectorization is performance related. I've added benchmarking for based on your code (with amendments). – jayb Jul 08 '20 at 16:04
  • So, I edited the answer. Wanted to do it yesterday, but stackoverflow was down. I tested and the `loop.function()` is faster - also for bigger datasets. However, it is a little weird, to change the question in that way. You asked about vectorization, not about performance. To the original question my answer was an answer. – Georgery Jul 09 '20 at 07:55
  • 1
    Hi, I only changed the original post to clarify some aspects of question, as you requested, and to add benchmarking for your code. The post was always tagged with performance and the first line always stated "I want to speed up a function..." – jayb Jul 09 '20 at 09:19