1

Unfortunately I got stuck and need your help.

I am initializing a data frame and try to fill it with new rows in a loop. It almost works as it should, only the first row gets an "NA" for the row.names value. Can anyone propose a solution for this and/or explain why this happens?

I am using the f3 approach from the answer in this question: How to append rows to an R data frame

Example:

df <- data.frame( "Type" = character(), 
                  "AvgError" = numeric(), 
                  "StandardDeviation"= numeric (), 
                  stringsAsFactors=FALSE)

for (i in 1:3){
  df[nrow(df) + 1, ]$Type           <- paste("Test", as.character(format(round(i, 2), nsmall = 2)))
  df[nrow(df), ]$AvgError           <- i/10
  df[nrow(df), ]$StandardDeviation  <- i/100
}

df
        Type AvgError StandardDeviation
NA Test 1.00      0.1              0.01
2  Test 2.00      0.2              0.02
3  Test 3.00      0.3              0.03

If I can provide any more informations, please comment and I will try to provide what I can. Thanks for the help.

Edit: Ok, thx for the discussion so far. I understand (and knew already before) that this is not the super best way to do this, because it is much slower than a functional approach but execution time is not important in this case. A work-around has been provided in the comments by @MrFlick, by just renaming the row.names at the end (rownames(df)<-1:nrow(df)). Anyway this helps, but it still feels unsatisfying to me since it doesn't treat the cause but only deals with the symptoms.

Community
  • 1
  • 1
cowhi
  • 2,165
  • 2
  • 17
  • 21
  • 4
    That question doesn't address the issue that it's generally a bad idea to build a data.frame row-by-row. It's better to build columns (Vectors) of data and then combine them into a data.frame when you're all done. `i<-1:3; df<-data.frame(Type=paste("Test", as.character(format(round(i, 2), nsmall = 2))), AvegError=i/10, StandardDeviation=i/100)` – MrFlick May 16 '15 at 22:10
  • Thx. Well, I am doing a whole bunch of other calculations in that loop. And the values I am putting in there are results of multiple lines of code. This is just a simplified example. It won't affect running time a lot, because the rest is taking a lot longer to run. Are there other concerns other than running time for this? Or why is it a bad idea? – cowhi May 16 '15 at 22:17
  • 1
    There are almost better, more "R-like" ways to do stuff like this. This looks like code you would write for a procedural language, not a functional language like R. But if you don't care, you can just fix the `rownames()` at the end with `rownames(df)<-1:nrow(df)`. – MrFlick May 16 '15 at 22:34
  • 1
    @cowhi Check out my answer below -- I've shown that you could be wasting more than a minute on data frames as small as 20,000 rows from the reallocation that takes place when you append to a data frame. Unless you have a short data frame, you will likely take a performance hit from appending one row at a time. – josliber May 16 '15 at 22:52
  • Why are you usning `f3` and not `f4` from the options provided? – A5C1D2H2I1M1N2O1R2T1 May 17 '15 at 06:54
  • @Ananda: As I said before, this approach is fast enough for me. It's not important that it is inefficient in this case. I am not interested in finding a more efficient solution, I just would like to know, why this strange behavior is happening. I understand there are better options, but none of them helps me to understand what is going on in my code. Thx. – cowhi May 17 '15 at 16:54

1 Answers1

1

Growing data frames by appending one row at a time makes your code inefficient because you need to continue reallocating the entire space for your data frame at each iteration. Especially as you grow to large object sizes, this can cause your code to be quite slow. You can read all about this issue in Circle 2 of the R inferno.

As an example, consider your code versus a similar code that computes each row of the data frame separately and then combines them together at the end with do.call and rbind:

OP <- function(vals) {
  df <- data.frame( "Type" = character(), 
                    "AvgError" = numeric(), 
                    "StandardDeviation"= numeric (), 
                    stringsAsFactors=FALSE)
  for (i in vals){
    df[nrow(df) + 1, ]$Type           <- paste("Test", as.character(format(round(i, 2), nsmall = 2)))
    df[nrow(df), ]$AvgError           <- i/10
    df[nrow(df), ]$StandardDeviation  <- i/100
  }
  row.names(df) <- vals
  df
}

josilber <- function(vals) {
  ret <- do.call(rbind, lapply(vals, function(x) {
    data.frame(Type=paste("Test", as.character(format(round(x, 2), nsmall = 2))),
               AvgError = x/10,
               StandardDeviation = x/100,
               stringsAsFactors=FALSE)
  }))
  ret
}

all.equal(OP(1:10000), josilber(1:10000))
# [1] TRUE
system.time(OP(1:10000))
#    user  system elapsed 
#  17.849   1.325  19.147 
system.time(josilber(1:10000))
#    user  system elapsed 
#   4.685   0.027   4.713 

The code that waits until the end to combine each row is 4 times faster than the code that continuously appends to the data frame for a data frame of length 10,000. Basically you've introduced 15 seconds of delay for memory reallocation that had nothing to do with the per-row computation, and that's only for a data frame with 10,000 rows. The wasted computation is up to 64 seconds for data frames of length 20,000:

system.time(OP(1:20000))
#    user  system elapsed 
#  70.755   7.065  77.717 
system.time(josilber(1:20000))
#    user  system elapsed 
#  12.502   0.968  13.470 

As noted in the comments, there are much quicker ways to build these particular data frames (computing each variable in one shot with vectorized functions), but I've limited my function josilber to code that computes each row one-by-one to demonstrate that appending can still have significant performance implications.

josliber
  • 43,891
  • 12
  • 98
  • 133