I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like:
3453,0.034
31,0.031
567,0.456
...
Currently, I am iterating over all the files, and using read.csv()
to import each file into a big list. Here's a simplified version:
allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir))
s$scores = list()
for (i in 1:length(allFileNames)){
if ((i %% 1000) == 0){
cat(sprintf("%d of %d\n", i, length(allFileNames)))
}
fileName = allFileNames[i]
approachID = getApproachID(fileName)
bugID = getBugID(fileName)
size = file.info(fileName)$size
if (!is.na(size) && size > 0){ # make sure file exists and is not empty
tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric"))
colnames(tmp) = c("fileCode", "score")
s$scores[[approachID]][[bugID]] = tmp
} else {
# File does not exist, or is empty.
s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1)
}
}
tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")
Later in my code, I go back through each matrix in the list, and calculate some metrics.
After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this?
EDIT: I added more details about my code.