26

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.

a   = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b   = as.data.frame(a)

system.time({
    u1 = unique(a)
})
 user  system elapsed
1.840   0.000   1.846


system.time({
    u2 = unique(b)
})
 user  system elapsed
0.380   0.000   0.379

The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.

  1. Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.

  2. Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?


Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).

Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.

Iterator
  • 20,250
  • 12
  • 75
  • 111

3 Answers3

13
  1. Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.

  2. Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.

a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
   user  system elapsed 
   2.98    0.00    2.99 
system.time(u2<-unique(b))
   user  system elapsed 
   0.99    0.00    0.99 
c = as.data.table(b)
system.time(u3<-unique(c))
   user  system elapsed 
   0.03    0.02    0.05  # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • +1 Nice! I get even larger speedups on the 5X dataset (60X for the original set, and about 600X if `setkey` is used first; relative to `u1`), so this is all that much more compelling to me. In the R chat room, it was mentioned that the matrix version uses `paste` on every row. In any case, avoiding `paste` is a good idea for a computation as primitive as "unique", and `data.table()` certainly does very well. (I think switching to data.table is probably the best way to do this, but I'll leave the question open - if someone pinpoints the issue, they'll have answered the original question.) – Iterator Oct 18 '11 at 18:16
  • @Iterator Ok, I like your logic. – Matt Dowle Oct 18 '11 at 19:26
  • unique.data.table us a very different function to unique.matrix and unique.data.frame: (1) it only checks uniqueness of the key, not the whole row and (2) it checks for identity of the variable, not of the character representation of the variables (which is why the others need all those paste calls; see the documentation). – Allan Engelhardt Oct 19 '11 at 06:14
  • @Allan Engelhardt (1) Was true in 1.6.6 on CRAN but 1.6.7 on R-Forge checks the whole row when there is no key (and now works, thanks to another Iterator question, see link above). Otherwise `identical()` wouldn't have returned `TRUE` above. (2) That's only relevant for numeric columns. `paste` is a slow solution otherwise (i.e. integer, factor and character columns). `unique.data.table` could be changed to take account of `.Machine$double.eps` (and is now [FR#1626](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1626&group_id=240&atid=978), thanks) – Matt Dowle Oct 19 '11 at 09:13
  • `unique.data.table` now respects tolerance on double precision columns, in v1.7.2. Similar fix to `duplicated.data.table`, and both now work on unkeyed tables as well as keyed tables. – Matt Dowle Oct 31 '11 at 23:54
13
  1. In this implementation, unique.matrix is the same as unique.array

    > identical(unique.array, unique.matrix)

    [1] TRUE

  2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

    collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

    temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

  3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?

Iterator
  • 20,250
  • 12
  • 75
  • 111
Allan Engelhardt
  • 1,421
  • 10
  • 5
  • +1 Thanks for the correct answer! Sorry for a slow accept. I added a bit of code for point #2 to make the answer more precise. As for whether it's what I want - I want unique rows, and this is the way to get it. However, the version for data.table is very fast, so I will be using that. In most of my cases, the implementation of the data structure isn't particularly interesting, but speed and ease of use does matter. I'll switch to whatever satisfies my need for speed. :) – Iterator Oct 22 '11 at 04:10
9

In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.

Here are the results for the first unique operation (for the matrix):

> summaryRprof("u1.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   5.70    52.58       5.96     54.98
"apply"                   2.70    24.91      10.68     98.52
"FUN"                     0.86     7.93       6.82     62.92
"lapply"                  0.82     7.56       1.00      9.23
"list"                    0.30     2.77       0.30      2.77
"!"                       0.14     1.29       0.14      1.29
"c"                       0.10     0.92       0.10      0.92
"unlist"                  0.08     0.74       1.08      9.96
"aperm.default"           0.06     0.55       0.06      0.55
"is.null"                 0.06     0.55       0.06      0.55
"duplicated.default"      0.02     0.18       0.02      0.18

$by.total
                     total.time total.pct self.time self.pct
"unique"                  10.84    100.00      0.00     0.00
"unique.matrix"           10.84    100.00      0.00     0.00
"apply"                   10.68     98.52      2.70    24.91
"FUN"                      6.82     62.92      0.86     7.93
"paste"                    5.96     54.98      5.70    52.58
"unlist"                   1.08      9.96      0.08     0.74
"lapply"                   1.00      9.23      0.82     7.56
"list"                     0.30      2.77      0.30     2.77
"!"                        0.14      1.29      0.14     1.29
"do.call"                  0.14      1.29      0.00     0.00
"c"                        0.10      0.92      0.10     0.92
"aperm.default"            0.06      0.55      0.06     0.55
"is.null"                  0.06      0.55      0.06     0.55
"aperm"                    0.06      0.55      0.00     0.00
"duplicated.default"       0.02      0.18      0.02     0.18

$sample.interval
[1] 0.02

$sampling.time
[1] 10.84

And for the data frame:

> summaryRprof("u2.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   1.72    94.51       1.72     94.51
"[.data.frame"            0.06     3.30       1.82    100.00
"duplicated.default"      0.04     2.20       0.04      2.20

$by.total
                        total.time total.pct self.time self.pct
"[.data.frame"                1.82    100.00      0.06     3.30
"["                           1.82    100.00      0.00     0.00
"unique"                      1.82    100.00      0.00     0.00
"unique.data.frame"           1.82    100.00      0.00     0.00
"duplicated"                  1.76     96.70      0.00     0.00
"duplicated.data.frame"       1.76     96.70      0.00     0.00
"paste"                       1.72     94.51      1.72    94.51
"do.call"                     1.72     94.51      0.00     0.00
"duplicated.default"          0.04      2.20      0.04     2.20

$sample.interval
[1] 0.02

$sampling.time
[1] 1.82

What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.

Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • Notice that with 5X as many rows (500K), the matrix version takes 6X as long as before, while the DF version takes 4.7 times as long. 8 seconds is a pretty considerable difference, and yet this is all I was willing to wait for this example. I noticed it when I ran an operation that took 20+ minutes - a much larger matrix. In an earlier version of the code, the same object was a data frame, so I knew the code used to be faster. – Iterator Oct 18 '11 at 15:44
  • So that would indicate that there is no "logical" (read: principial) reason in the difference between matrix and data.frame, it's just state of the software: one method was optimized properly (i.e. writen in C, most likely), while the other one was just written in R. – Tomas Oct 18 '11 at 23:00
  • @TomasT. Actually, there is a difference in the code at the R level. I've been waiting for someone else to look at the code & write an answer, to give them credit for the exposition. When I asked the question, I was not yet alert. With the timings in this answer, the reason the code is slow makes sense. I'll wait a bit longer and answer this myself if someone else doesn't. I think it's a nice little bit of R pedagogy to describe how to find out what's going on and then explain what the code is doing. In practice, we should use data tables instead. It is now much faster. – Iterator Oct 18 '11 at 23:05
  • @Iterator I'm interested in a related question: why `unique` is so much faster when applied to `data.table` vs `data.frame`, which in one of my tests was so stark that the latter could not even successfully complete whereas the former did in less than a second. Have you explored this topic in-depth enough to be able to say why the implementation for `data.frame` seems *comparitively* so inefficient? – Rookatu Nov 08 '18 at 02:07