0

I have done some analyses on my simulated data and have generated around 100,000 datasets (dataSize). What I want to do is to extract two data items (dat1 & dat2) from file1 and one data item (dat3) from file2 for each dataset, and then combine all of them into one single data frame tab_out.

Each dataset has a different sample size, but the estimated total sample size for the 100,000 datasets are somewhere below 10,000,000 subjectCountTotal.

Below are sample codes as an reproducible example:

path <- "*REDACTED*"

dataSize <- 100
subjectCountTotal <- 10200
tab_out <- data.frame(dataID=integer(subjectCountTotal),
                      ID=integer(subjectCountTotal),
                      dat1=double(subjectCountTotal),
                      dat2=double(subjectCountTotal),
                      dat3=double(subjectCountTotal))
count <- 0

for(dataID in 1:dataSize) {
  #subdir name determination
    if((dataID-1)%%100==0) {
      subdir <- paste(sprintf("%06d", dataID), "-", sprintf("%06d", dataID+99), sep="")
      setwd(paste(path, subdir, sep = "/"))
    }

  #file name
  file1_name <- paste(
    "file1_",
    sprintf("%06d", dataID),
    sep=""
  )
  file2_name <- paste(
    "file2_",
    sprintf("%06d", dataID),
    sep=""
  )

  #Read files
  file1 <- read.table(file1_name, skip=1, header=TRUE)
  file2 <- read.table(file2_name, skip=1, header=TRUE)

  sample_size <- max(file2$ID) #Find sample size of the dataset

  #Extracting dat1 & dat2
  dat12 <- data.frame(dataID=integer(sample_size),
                      ID=integer(sample_size),
                      dat1=double(sample_size),
                      dat2=double(sample_size)
                      )
  for(i in 1:sample_size) {
    dat12[i, "dataID"] <- dataID
    dat12[i, "ID"] <- i
    dat12[i, "dat1"] <- file1[2*i-1, "DAT"]
    dat12[i, "dat2"] <- file1[2*i, "DAT"]
  }

  #Extracting dat3
  dat3 <- double(sample_size)
  for(i in 1:sample_size) {
    dat3[i] <- file2[which(file2$ID==i)[1], "DAT3"]
  }

  #Combining dat into output data frame
  tab_out[(count+1):(count+sample_size), 1:4] <- dat12[1:sample_size, 1:4]
  tab_out[(count+1):(count+sample_size), 5] <- dat3

  #Assigning indices for next dataset
  count <- count + sample_size

  #Progress prompt
  if(dataID%%100==0 || dataID==dataSize) {
    cat(paste("\n", dataID, "/", dataSize, sep=""))
  }
}

Here is a package for replicating the process: reproducible exmaple with source code

I am new to R and I just escaped from the 2nd circle of Hell (if I learnt it correctly...). The data extraction progress now does not slow down over time, but the above is still estimated to take about 5 hours to finish on my PC.

I am wondering if there are still methods to speed it up.

Thanks!

Matthew Hui
  • 634
  • 5
  • 18
  • A [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would help, providing us a sample of `file1` and `file2` – Jake Kaupp Nov 01 '17 at 11:07
  • I have a reproducible example attached above. I have just noticed that the pre-defined size of the data frame does affect the speed of each loop... But what can be the workaround? – Matthew Hui Nov 01 '17 at 16:37

0 Answers0