Why were pandas merges in python faster than data.table merges in R in 2012?

Question

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Comparison

Here's the R code and the Python code used to benchmark the various packages.

My hypothesis: because data.table is based on data.frame and data.frames are slow. And I think most of the pandas merge code is in Cython. — Joshua Ulrich, Jan 24 '12 at 18:15
@JoshuaUlrich: IIRC `data.table` just inherits from `data.frame`, but it relies on C-code under the hood. — digEmAll, Jan 24 '12 at 18:25
@digEmAll: data.frames are slow even if you manipulate them in C, but I've never looked at the data.table source. — Joshua Ulrich, Jan 24 '12 at 18:36
@JoshuaUlrich: it implements basically a binary search on a sorted data.frame. But yes, probably is slower than pandas because of the access to data.frame's (I wouldn't call them "slow" though, just "slower") — digEmAll, Jan 24 '12 at 20:53
@Joshua What do you mean by "data.frames are slow even if you manipulate them in C"? Is that relative to something else? And slow at what? — Matt Dowle, Jan 25 '12 at 05:40
@digEmAll `data.frame`'s are a vector of pointers to column vectors. What can be better than that as a storage mechanism? I know that row storage (as in SQL) is much worse for the types of ordered data a lot of us deal with. How does pandas store a DataFrame? I'm not saying `data.frame` is fast, I'm just saying there isn't anything fundamentally wrong with how it is stored in memory, afaik. — Matt Dowle, Jan 25 '12 at 05:43
@MatthewDowle: I agree, but having no direct experience in manipulating data.frame's in C I couldn't prove JoshuaUlrich opinion was wrong. Anyway, as you can notice, I didn't fully accept the label of "slow" ;) — digEmAll, Jan 25 '12 at 07:38
@MatthewDowle: relative to matrices; slower at accessing elements of the object. — Joshua Ulrich, Jan 25 '12 at 11:06
@Joshua That's simply not true at C level. Do you mean at R level? If so, have you seen `:=` in `data.table` which, for one particular easily reproducible 'element access' test, appears to be [500 times faster](http://stackoverflow.com/questions/7029944/when-should-i-use-the-operator-in-data-table) than `data.frame`? Is that what you mean? `data.frame` can be fast when manipulated in C, as `data.table` does, iiuc. — Matt Dowle, Jan 25 '12 at 13:25
@MatthewDowle: in the example you link to, the `:=` operator still appears to be several times slower than matrix access (`system.time(for (i in 1:1000) m[i,1] <- i)`). I.e. you're still slower than matrix access even after all the awesome stuff you've done at the C level. — Joshua Ulrich, Jan 25 '12 at 14:09
@Joshua Ah ok. I guess I was content at the time to reduce 10 mins to 1 sec ;) What do you see 1 sec reduce to with matrix? That sub 1 sec difference (?) may be down to checks at the top of [.data.table. I could take a look. The test is a little unrealistic as we can usually vectorize of course, but worth looking into. I don't think it's anything at C level, though. — Matt Dowle, Jan 25 '12 at 14:56
@MatthewDowle: data.table took ~0.5s and matrix is 0.02s (at most). We can talk more about this in chat or over email (my address is easy to find). — Joshua Ulrich, Jan 25 '12 at 15:05
@JoshuaUlrich I just noticed this comment trail was never put to bed. So to clear it up: `set()` was added to `data.table` soon after this discussion. Very similar to `:=` but avoids the small overhead of `[.data.table` when looped and is consequently as fast as `matrix`. Therefore, `data.frame` _can_ be manipulated just as fast as matrix. Benchmark is [here](http://stackoverflow.com/a/7030140/403310). — Matt Dowle, Oct 02 '13 at 20:20
Can we get an updated version of this benchmark, it is pretty clear that this bench was actually an edge case and that this is fixed by now. Given that all benchmarks I have seen show that data.table is faster I'd like to see what the merge number are ? — statquant, Apr 02 '16 at 07:57
@statquant I didn't run the original benchmark, but I'd really love to see Wes update the benchmark. — Zach, Apr 04 '16 at 18:03
Related: I built a [cross reference guide](https://github.com/ben519/DataWrangling) for `pandas` and `data.table`, including some `make_data()` functions so you can easily compare the runtime of just about any operation in data.table to pandas using realistic data of any size. — Ben, Sep 30 '16 at 17:15

score 191 · Answer 1 · edited Mar 24 '16 at 11:30

191

The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation - klib and in C/Cython to avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.

The comparison with data.table is actually a bit interesting because the whole point of R's data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas' DataFrame contains no pre-computed information that is being used for the merge, so to speak it's a "cold" merge. If I had stored the factorized versions of the join keys, the join would be significantly faster - as factorizing is the biggest bottleneck for this algorithm.

I should also add that the internal design of pandas' DataFrame is much more amenable to these kinds of operations than R's data.frame (which is just a list of arrays internally).

edited Mar 24 '16 at 11:30

Alex Riley

169,130
45
262
238

answered Jan 24 '12 at 19:17

Wes McKinney

101,437
32
142
108

79

Of course, now that you've figured it all out in python, it should be easy to translate into R ;) – hadley Jan 24 '12 at 22:35
37

But why would anyone ever want to? :) – ely Jul 13 '12 at 03:07
9

Umm ... maybe because they would want data operations to be faster in R? Just guessing :)) – lebatsnok Sep 28 '13 at 15:24
4

I am the 100 ! A new benchmark would be usefull, did I win this gift fot the 100th vote ? – statquant Oct 20 '13 at 14:18
33

Hi Wes-- it seems that your results for `data.table` were primary driven by a bug that has since been fixed. Any chance you could re-run your benchmark and write an updated blog post? – Zach Apr 21 '14 at 19:42
9

Zach make sure you check this out: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping – Merik Sep 09 '15 at 22:07
The link for "A look inside..." is broken, you might want to fix it to http://wesmckinney.com/blog/nycpython-1102012-a-look-inside-pandas-design-and-development/ . – lmsasu Jan 13 '16 at 07:50
Is it possible to direct pandas to create an index on sortable/comparable data columns in two dataframes before joining? Specifically asking for inner joins where having sorted indexed join columns can give me merge joins which might be faster – faizan Sep 18 '17 at 08:42
4

Updated benchmarks are here, which show 1) data.table is faster even for "cold" grouping 2) pandas doesn't have the memory efficiency necessary to do grouping with a 50gb table in 125gb of ram https://h2oai.github.io/db-benchmark/ – Michael Feb 12 '21 at 20:25

Matt Dowle · Accepted Answer · 2013-09-26T09:46:45.943

It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.

Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn't hooked up yet to replace the levels to levels match.

Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

I'll need some time to confirm as it's the first I've seen of the comparison to data.table as presented.

UPDATE from data.table v1.8.0 released July 2012

Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

also in that release was :

character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.

But as I wrote originally, above :

data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.

In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.

If you supply a test case for a reasonably large, realistic data set, I'll be happy to run the benchmarks. You're more than welcome to, also. I actually have not yet optimized the code for the integer join key case (put that on my todo list!), but you can expect significantly better performance than the string case given the hash table study in the linked presentation. — Wes McKinney, Jan 25 '12 at 15:08
I don't use either of these libraries but pleased to see a constructive response from the R side in the shape of Matthew Dowle. — SlowLearner, Jan 26 '12 at 00:06
Here's some Rprof results http://pastie.org/3258362. It looks like 20-40% of the time is spent in sortedmatch depending on the join type. Will have to look into integer columns another time-- I made a pandas GitHub issue to remind me to optimize that case (https://github.com/wesm/pandas/issues/682) — Wes McKinney, Jan 26 '12 at 19:36
@AndyHayden Improvements were made some time ago. I'll edit in the NEWS items. Wes picked on one specific test (equi joining two character columns) which played on that known problem. If he'd picked integer columns it would have been different. And if he'd given me a heads up before presenting the benchmark at the conference then I could have told him more about the known problem. — Matt Dowle, Sep 26 '13 at 08:52

score 39 · Answer 3 · edited Apr 19 '17 at 22:55

This topic is two years old but seems like a probable place for people to land when they search for comparisons of Pandas and data.table

Since both of these have evolved over time, I want to post a relatively newer comparison (from 2014) here for the interested users: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

It would be interesting to know if Wes and/or Matt (who, by the way, are creators of Pandas and data.table respectively and have both commented above) have any news to add here as well.

-- UPDATE --

A comment posted below by jangorecki contains a link that I think is very useful: https://github.com/szilard/benchm-databases

This graph depicts the average times of aggregation and join operations for different technologies (lower = faster; comparison last updated in Sept 2016). It was really educational for me.

Going back to the question, R DT key and R DT refer to the keyed/unkeyed flavors of R's data.table and happen to be faster in this benchmark than Python's Pandas (Py pandas).

@Zach see this: https://github.com/szilard/benchm-databases and that is nice too: https://speakerdeck.com/szilard/r-stories-from-the-trenches-budapest-r-meetup-aug-2015 — jangorecki, Sep 13 '15 at 21:14
@Zach a four years later new benchmark results finally came up, see my answer below. — jangorecki, Nov 15 '19 at 04:17

jangorecki · Answer 4 · 2019-11-15T04:13:22.843

There are great answers, notably made by authors of both tools that question asks about. Matt's answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.

In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.

Timings I am presenting are coming from db-benchmark project, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.

As of now... (still to be implemented: one more data size and 5 more questions)

We tests 2 different data sizes of LHS table.
For each of those data sizes we run 5 different merge questions.

q1: LHS inner join RHS-small on integer
q2: LHS inner join RHS-medium on integer
q3: LHS outer join RHS-medium on integer
q4: LHS inner join RHS-medium on factor (categorical)
q5: LHS inner join RHS-big on integer

RHS table is of 3 various sizes

small translates to size of LHS/1e6
medium translates to size of LHS/1e3
big translates to size of LHS

In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).

As of now (run on 2nd Nov 2019)

pandas 0.25.3 released on 1st Nov 2019
data.table 0.12.7 (92abb70) released on 2nd Nov 2019

Below timings are in seconds, for two different data sizes of LHS. Column pd2dt is added field storing ratio of how many times pandas is slower than data.table.

0.5 GB LHS data

+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        0.51  |    3.60  |      7 |
| q2        |        0.50  |    7.37  |     14 |
| q3        |        0.90  |    4.82  |      5 |
| q4        |        0.47  |    5.86  |     12 |
| q5        |        2.55  |   54.10  |     21 |
+-----------+--------------+----------+--------+

5 GB LHS data

+-----------+--------------+----------+--------+
| question  |  data.table  |  pandas  |  pd2dt |
+-----------+--------------+----------+--------+
| q1        |        6.32  |    89.0  |     14 |
| q2        |        5.72  |   108.0  |     18 |
| q3        |       11.00  |    56.9  |      5 |
| q4        |        5.57  |    90.1  |     16 |
| q5        |       30.70  |   731.0  |     23 |
+-----------+--------------+----------+--------+

Thank you for the update from the future! Could you add a column for the R vs python implementation of data.table? — Zach, Jan 03 '20 at 19:20
I think it is good to just go to website and check it, even for looking at R dt vs pandas. And pyDT was not part of original question really. — jangorecki, Jan 04 '20 at 07:16

Why were pandas merges in python faster than data.table merges in R in 2012?

4 Answers4

Linked