I am relatively often confronted with the building of a data.frame
in R without knowing before-hand the number of rows I will need.
When the number of rows is known
nbrows = 10^5
d = data.frame(
A=vector("numeric",nbrows),
B=vector("numeric",nbrows)
)
for (i in 1:nbrows)
{
####
# Do Stuff
####
d[i,] = newline
}
When the number of rows is unknown: Method 1
Here is what I usually do
d = data.frame(
A=numeric(),
B=numeric()
)
while (TRUE)
{
###
# Do Stuff
###
d = rbind(d, newlines) # Note that several lines might be added
if (condition) break
}
However, I suppose rbind
is time costly as new memory is allocated at every iteration
When the number of rows is unknown: Method 2
It would be handy to allocate (without initializing values, if possible) some memory before-hand and just increase this memory every time we overpass the previously allocated memory.
I suppose the following would be relatively efficient in comparison to Method 1
expected_nbrows = 10^5
d = data.frame(
A=vector("numeric",expected_nbrows),
B=vector("numeric",expected_nbrows)
)
i = 1
while (TRUE)
{
####
# Do Stuff
####
nblines = nrow(newlines)
if (i + nblines > nrow(d)) # if more memory is needed, then double the memory allocated to `d`
{
d = rbind(
d,
data.frame(
A=vector("numeric",nrow(d)),
B=vector("numeric",nrow(d))
)
)
}
d[i:(i+nblines),] = newlines
i = i + nblines
if (condition) break
}
Question
Is Method 2 really more performant (in terms of CPU time) than Method 1?
Is there yet another better method?