1

I am relatively often confronted with the building of a data.frame in R without knowing before-hand the number of rows I will need.

When the number of rows is known

nbrows = 10^5
d = data.frame(
  A=vector("numeric",nbrows),
  B=vector("numeric",nbrows)
)
for (i in 1:nbrows)
{
  ####
  # Do Stuff
  ####
  d[i,] = newline
}

When the number of rows is unknown: Method 1

Here is what I usually do

d = data.frame(
  A=numeric(),
  B=numeric()
)

while (TRUE)
{
  ###
  # Do Stuff
  ###
  d = rbind(d, newlines) # Note that several lines might be added
  if (condition) break
}

However, I suppose rbind is time costly as new memory is allocated at every iteration

When the number of rows is unknown: Method 2

It would be handy to allocate (without initializing values, if possible) some memory before-hand and just increase this memory every time we overpass the previously allocated memory.

I suppose the following would be relatively efficient in comparison to Method 1

expected_nbrows = 10^5
d = data.frame(
  A=vector("numeric",expected_nbrows),
  B=vector("numeric",expected_nbrows)
)
i = 1
while (TRUE)
{
  ####
  # Do Stuff
  ####
  nblines = nrow(newlines)
  if (i + nblines > nrow(d)) # if more memory is needed, then double the memory allocated to `d`
  {
    d = rbind(
      d,
      data.frame(
        A=vector("numeric",nrow(d)),
        B=vector("numeric",nrow(d))
      )
    )
  }
  d[i:(i+nblines),] = newlines
  i = i + nblines
  if (condition) break
}

Question

Is Method 2 really more performant (in terms of CPU time) than Method 1?

Is there yet another better method?

Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • 2
    duplicate? http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner – dww May 30 '16 at 02:26
  • 2
    I suppose, but the specific methods you suggest aren't compared there. So yes, use data table if you want the fastest way. But I'm more curious about your existing question, because I was pretty sure I'd read somewhere that just assigning to the rows you want, without explicitly creating first (which does work) does the doubling thing behind the scenes. Haven't found any references about that yet, but if I'm still curious, I guess I'll have to write my own question. – Aaron left Stack Overflow May 30 '16 at 02:52
  • @Aaron, or maybe Remi.b could benchmark his/her methods against those in the other question and add them as an answer there. Would that satisfy your curiosity, or is it more that bechmarking that you're after? – dww May 30 '16 at 03:06
  • I'm rereading those answers now and they all seem to assume preallocation is possible and focus on methods for accessing and replacing elements, instead of growing the output to an undetermined size. Am I missing something? – Aaron left Stack Overflow May 30 '16 at 03:23
  • FWIW, just assigning to the rows doesn't double as it goes. – Aaron left Stack Overflow May 30 '16 at 03:43

0 Answers0