2

My current project is to take a dataset with aggregated data (i.e. time intervals like '2002-2006') and turn it into de-aggregated rows (i.e. one row per time - one row each for 2002, 2003, 2004, 2005, and 2006).

In order to create the new rows I have to parse the aggregate row, do some calculations, and compile (coalesce?) the new data into the new rows. I've completed the code to do all of that.

At issue currently is that I'm looking for a computationally fast way to append rows to the end of a dataframe and then check the number of rows in the dataframe. I have tried a couple of things, and below you can see some pseudocode do do the row appending.

mydf <- rbind(mydf, newRow1, newRow2)
if(nrow(mydf) %% 200 == 0){
    print(nrow(mydf)
    }

but that gets slow pretty fast.

I tried using a counter and semi-explicitly putting the data in the proper rows - like this.

mydf[2*counter - 1,] <- newRow1
mydf[2*counter,] <- newRow2
if(nrow(mydf) %% 200 == 0){
    print(nrow(mydf)
    }

but that slows down rather quickly also.

Is there a fast way to do this? I have about 200,000 rows, so even the simple example above would result in a 1,000,000 row output and a whole lot of row appendings. Is the slowness simply a function of the computing resources available to my computer? Would it go faster if I didn't ask the computer to print out the progress? Should I simply accept that this will take a really long time (and then just start it and let it run overnight)?

THill3
  • 87
  • 6
  • 2
    The best way to incrementally add rows to a `data.frame` is to not do it that way. Come up with your new rows and then do a one-time rbind with `do.call(rbind, list_of_new_frames))`. See https://stackoverflow.com/a/24376207/3358272 for a good discussion on lists-of-frames (though some of it may not directly apply to you). – r2evans Jul 22 '20 at 18:48
  • 2
    The "why" of that is this: every time you add a row to a frame, it makes a complete copy of that frame in memory. This means that if you start with 500 rows, then at some point in time you have 500 rows and 501 rows in memory, with the 500-row version likely being garbage-collected. If you do this 100 times, then you can see how copying 500 rows 100 times seems unnecessary. If you have many more rows, this starts to get expensive, and code blocks that rely on this will see incremental slow-downs as more processing is done. – r2evans Jul 22 '20 at 18:50
  • I recognize that it's inefficient. That's why I asked for help. Your answer appears to assume that I can get all of the processing done without storing the new rows at intermediate steps, get all of the rows stored up in a list, and then just append the list with ```do.call```. Is that what you mean? I should append the new rows to a list instead of the initial blank dataframe and then just append the entire list to the dataframe at once? – THill3 Jul 22 '20 at 18:56
  • Yes, that's what I mean. Are you able to continue calculation without having added the previous iteration new-row to the frame? – r2evans Jul 22 '20 at 19:08
  • I would think that I have to process a single row into the 5 rows, append all 5 of those to the df of new data, process the next row (into 5 rows), append, and so forth for each of the 200,000 rows. Is appending to a list faster and more efficient than appending to a df? Should I do the following: get the row to process, process it into 5 new rows, append all five rows to a list, iterate over all of the original rows, and then add the new (1,000,000-element) list to the dataframe? – THill3 Jul 22 '20 at 19:23
  • 1
    Can you provide a miniature, reproducible example for the type of task you are doing? It would help people give you most specific feedback (and understand the problem more) – Andrew Jul 22 '20 at 19:31

0 Answers0