0

I'm trying to wrangle USR files (around 7,000) into a Long data format.

I've created the below, but it takes over 2 hours to run (hence the reason for adding the progress printer).

Does anyone have any idea how I can speed up this code? Are there specific lines that are slowing it down?

Thanks in advance!

for(i in D_flows){
  flow <- read.table(i, header = F, fill = T, sep = "|") 
  for(j in flow){  
    Flow_name <- i
    Timestamp <- ymd_hms(flow[flow$V1 == "ZHV",8])
    Date <- ymd(flow[flow$V1 == "ZPD",2])
    SR <- as.vector(flow[flow$V1 == "ZPD",3])
    SP <- as.integer(as.vector(flow[flow$V1 == "SE1",2]))
    EV <- as.numeric(as.character(flow[flow$V1 == "SE1" , 4]))
    
    Flow_data <- tibble(Flow_name, Timestamp, Date, SR, SP, EV)
    Flow_data <- Flow_data[complete.cases(Flow_data),]
    Flow_data <- Flow_data %>% 
      group_by(SP) %>% 
      mutate(MEV = sum(EV)) %>%
      select(Flow_name, Timestamp, Date, SR, SP, MEV) %>%
      unique() %>%
      ungroup()
  } 
  #Append the flow data to the D Flow data file
  D_flow_data <- bind_rows(D_flow_data, Flow_data)
  #Shows the progress of the for loop
  progress <- D_flow_data %>% 
    select(-Timestamp, -Date, -SR, -SP, -MEV) %>% 
    unique()
  print(nrow(progress))  
}
Grgwizard
  • 13
  • 3
  • 2
    for loops are slow! So is row binding. So, I'd convert your outer for loop to an lapply and the row_bind the results of that. (So you rbind once rather than once per element of D_flows.) I'm not sure what a for loop over a data frame does, but I don't see a reference to j inside the loop, so you seem to be repeating the same code, unchanged, a number of times that depends on the size of the table you've just read. – Limey Mar 08 '21 at 14:50
  • If you provide some sample data and the disired output we will be able to help you with some code that should be more eficient than what you are showing... The part @Limey pointet out about not using j in the second loop is a bit confusing so having some input and desired output will greatly help to frame you problem and supply a solution – DPH Mar 08 '21 at 15:10
  • 1
    besides the foreloop itself and the binding (growing a vector which is know to be slow in R), possibly read.table is your critical bottle neck as it is known not to be fast. A faster way would be the data.table::fread() function for example (have a read here: https://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes) – DPH Mar 08 '21 at 15:17
  • last but not least you could do parallel processing since all files seem to be indepenend (you need no info from prior files to process the next one) - this is a bit more complex programming wise but could leverage a significant reduction of processing time depending on the size of your files and the parallel processing set up itself – DPH Mar 08 '21 at 15:36
  • 1
    @Limey `for` loops are not slow! What you do inside the loop is often slow. There is a pretty large overhead in calling a closure and usually many such calls are made with a `for` loop. – Roland Mar 08 '21 at 15:56
  • 1
    I suggest you first import all files and rbind the data.frames (I would use `lapply` with `data.table::fread` followed by `data.table::rbindlist`). You could then do everything else in the loop at once (possibly by adding another grouping level according to the file). – Roland Mar 08 '21 at 16:00
  • Relevant concerning speed of for-loops: https://stackoverflow.com/a/7144801/6574038 – jay.sf Mar 08 '21 at 16:28

0 Answers0