I have done some analyses on my simulated data and have generated around 100,000 datasets (dataSize
). What I want to do is to extract two data items (dat1
& dat2
) from file1
and one data item (dat3
) from file2
for each dataset, and then combine all of them into one single data frame tab_out
.
Each dataset has a different sample size, but the estimated total sample size for the 100,000 datasets are somewhere below 10,000,000 subjectCountTotal
.
Below are sample codes as an reproducible example:
path <- "*REDACTED*"
dataSize <- 100
subjectCountTotal <- 10200
tab_out <- data.frame(dataID=integer(subjectCountTotal),
ID=integer(subjectCountTotal),
dat1=double(subjectCountTotal),
dat2=double(subjectCountTotal),
dat3=double(subjectCountTotal))
count <- 0
for(dataID in 1:dataSize) {
#subdir name determination
if((dataID-1)%%100==0) {
subdir <- paste(sprintf("%06d", dataID), "-", sprintf("%06d", dataID+99), sep="")
setwd(paste(path, subdir, sep = "/"))
}
#file name
file1_name <- paste(
"file1_",
sprintf("%06d", dataID),
sep=""
)
file2_name <- paste(
"file2_",
sprintf("%06d", dataID),
sep=""
)
#Read files
file1 <- read.table(file1_name, skip=1, header=TRUE)
file2 <- read.table(file2_name, skip=1, header=TRUE)
sample_size <- max(file2$ID) #Find sample size of the dataset
#Extracting dat1 & dat2
dat12 <- data.frame(dataID=integer(sample_size),
ID=integer(sample_size),
dat1=double(sample_size),
dat2=double(sample_size)
)
for(i in 1:sample_size) {
dat12[i, "dataID"] <- dataID
dat12[i, "ID"] <- i
dat12[i, "dat1"] <- file1[2*i-1, "DAT"]
dat12[i, "dat2"] <- file1[2*i, "DAT"]
}
#Extracting dat3
dat3 <- double(sample_size)
for(i in 1:sample_size) {
dat3[i] <- file2[which(file2$ID==i)[1], "DAT3"]
}
#Combining dat into output data frame
tab_out[(count+1):(count+sample_size), 1:4] <- dat12[1:sample_size, 1:4]
tab_out[(count+1):(count+sample_size), 5] <- dat3
#Assigning indices for next dataset
count <- count + sample_size
#Progress prompt
if(dataID%%100==0 || dataID==dataSize) {
cat(paste("\n", dataID, "/", dataSize, sep=""))
}
}
Here is a package for replicating the process: reproducible exmaple with source code
I am new to R and I just escaped from the 2nd circle of Hell (if I learnt it correctly...). The data extraction progress now does not slow down over time, but the above is still estimated to take about 5 hours to finish on my PC.
I am wondering if there are still methods to speed it up.
Thanks!