6

I have done some searching around the Internet and SO looking for an introduction or analysis of what makes data.table so fast, but I've only found a lot of (very helpful) manuals, no breakdown of what goes into the programming. (I am more or less completely floored that I can't locate a published paper for data.table, not even something from JStatSoft.)

I've had an algorithms class so I know about sorts and linked lists and binary trees and such, but I don't want to make any amateur guesses (especially when I go to explain to academic people why it's a good idea to use it). Can anyone offer a short, topical summary with references? This question references a slide presentation which is cool, but the info comes in pieces (and even the documentation for, say, setkey() doesn't cite a data.table reference, but goes to Wikipedia).

What I am looking for is something that is both not the source code and not a list of Wikipedia topics, but an ideally "official", sourced answer (thus making it canonical, which could help a lot with all the questions orbiting around this topic).

(It would be great if there was a technical paper out there I could cite for this (the citation() for data.table is just the manual, but of course it's not directly relevant to the question as far as SO is concerned.)

Community
  • 1
  • 1
bright-star
  • 6,016
  • 6
  • 42
  • 81
  • Darn, I thought this was an okay fit for the site. What did I miss? – bright-star May 20 '14 at 12:47
  • 1
    Not my downvote, but "why is `data.table` so great" is pretty subjective, also "Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow" [FAQ](https://stackoverflow.com/help/on-topic) – DNA May 20 '14 at 12:51
  • Definitely, the technical paper is just my thing. I'll change the wording of the question so it doesn't gush so much. – bright-star May 20 '14 at 12:52
  • 5
    +1 I hope Arun or Matt will write a nice canonical answer. I expect that it will include assignment by reference / overallocation, fast sorting and binary search. – Roland May 20 '14 at 13:05
  • 1
    useful, but still doesn't seem to be within the scope of the site to me ("specific programming questions") – Ben Bolker May 20 '14 at 13:39
  • Do you think it would be specific if I asked for an answer that linked each optimization to a named use case? the docs for `data.table` are already sort of written that way. – bright-star May 20 '14 at 13:43
  • 2
    I don't think it's off-topic per se, since it's clearly about "software algorithms" in a sense. The problem I see is that it asks for external sources, which seems somewhat off-topic. Maybe if you reformulate your question to be a bit more technical and ask for the answer directly, rather than a reference to the answer. It would also help to restrict yourself to a specific operation, because the question as such is really very broad – Niklas B. May 20 '14 at 13:47
  • Well, yes, my personal benefit is the sources, but the site's benefit is that the answer is sourced, right? – bright-star May 20 '14 at 13:49
  • 2
    Yes, exactly. SO wants to be self-contained. And good answers often make marvelous sources in itself. Of course a good answer could contain references as well, for further reading, but it should contain all the relevant information to answer the specific question – Niklas B. May 20 '14 at 13:52

0 Answers0