0

first question, I'll try to go straight to the point.

I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.

Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.

This is the loop, just so if it helps to understand the question:

my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
  my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}

The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.

Please let me know if it lacks specificity or concretion, or if it just does not belong here.

Aib's
  • 1
  • 1
    Please provide an excerpt of your data frame. Do not paste it from the console, use `dput`. Help: https://stackoverflow.com/a/5963610/6574038 – jay.sf Oct 07 '20 at 10:14
  • If you are iterating over 6 million rows and the iterations are independent of each other, you are definitely doing it wrong. So, no, you are not using the most efficient R instructions. – Roland Oct 07 '20 at 10:24
  • A large part of the efficiency of R lies in its vectorization. But if you do a `for` loop, you will probably lose this property, because you calculate with the elements of the vectors. – jay.sf Oct 07 '20 at 10:26
  • `my_list <- format(as.POSIXct(strptime(table[, "Date_of_count"],"%Y-%m-%d")), format = "%Y-%m-%d")` you could do this with one line, but what's the point, converting character to POSIXlt then to POSIXct and then back to character? – minem Oct 07 '20 at 12:14

1 Answers1

0

In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:

  • use vectors
  • avoid repeated conversions
  • if possible, avoid loops and apply functions directly over list/vector

I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.

df <- as.data.frame(table)

In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.

New execution time over 6 Mill rows: 25.25 sec.

Aib's
  • 1
  • If you want to further improve execution time and process you tables in a database way I suggest you to look `data.table` package. It is very efficient as it uses indexes to accelerate scans and reduce memory usage by preventing useless copy. – Billy34 Oct 09 '20 at 12:29