1

I have a working loop to read the text files (unknown names) from different folders (known location) and update those text files columns and saving again with same name in same folder

folders <- c(1,2,3)

for(i in seq_along(folders)){
        dt <- df[(df$id ==folders[i]),]
        dt$id <- NULL
        loc <- paste0("data/", folders[i])
        setwd(loc)
        file.names <- list.files(pattern = "*.txt$", all.files = FALSE,
                                 full.names = FALSE, recursive = FALSE,
                                 ignore.case = FALSE)


    for(j in seq_along(file.names)){
      text <- read.csv(file.names[j], header = F, stringsAsFactors = F)

      text2 <- merge(text, dt, by.x = "matched", by.y = "matched", all.x = T)
      write.table(text2, file.names[j], sep = ",", na="",
                  row.names = FALSE, quote = TRUE, col.names = F)
      rm(text,text2)
      print(j)
    }
}

There are two problems i'm facing, first one its very slow, second one it uses too much ram/memory. Tried to do myself but don't know much about R. It is possible to increase the speed by creating some functions and "if simply initialize a vector (with NAs, zeros or any other value) with the total length and then run the loop, we can drastically increase the speed of our algorithm". I wish I could do something like that myself.

Janjua
  • 235
  • 2
  • 13

1 Answers1

0

Addressing speed

I would recommend some print statements throughout your loops and see where time is being consumed.

Addressing memory (and speed)

You're using base R functions. Try using dplyr::left_join() instead of merge (dplyr tends to run c++ under the hood, and is therefore up to ~100 times faster than equivalent functions in base R). You could also try some data.table functions

rm() removes the objects, but doesn't actually free up the memory. Calling gc() after rm() will free up most of the memory that the now removed objects occupied. So try placing gc() after calling rm().

Regarding vector preallocation

Vector preallocation helps because it reduces the need for R to make copies of vectors when combining them. That doesn't appear to be a problem here (as your code doesn't combine vectors).

Community
  • 1
  • 1
stevec
  • 41,291
  • 27
  • 223
  • 311
  • tried gc() after rm() its taking more time as it pauses to clear up after every text file, i think i should try to find some other way – Janjua Jan 29 '20 at 18:45
  • @Janjua use judgement as to how often to `gc()`. How about every 20th iteration. You can use some code like `for(i in 1:100) if(i %% 20 == 0) { print(paste("Iteration", i)) }` – stevec Jan 30 '20 at 01:55