Appending a column means writing to the middle of a csv
file
Appending columns makes sense in a columnar data format such as SQL or arrow. The suggestion in the comments to use SQL is a good one and here is an answer using RSQLite
which does this.
In a row-based format, like csv
, appending a column is essentially writing to the middle of a plain text file. This means loading the file from (at best) somewhere on the first line, and re-writing everything after that point.
Append rows to a csv
file instead
The best way to write a large plain text file, without loading the entire contents into RAM, is one line at a time. In your case, that means:
- Read one row at a time from each of the input files.
cbind()
each piece of the row to make a complete row.
- Write the row out as either (in the case of the first row) the first line of a new text file, or (for subsequent rows) a line appended to an existing text file.
This would simply be:
outfile <- "./example/big_out_file.csv"
N_ROWS <- 11
for (i in seq(0, N_ROWS - 1)) {
line <- do.call(
cbind,
lapply(paths, \(f) read.table(f, sep = ",", skip = i, nrows = 1))
)
write.table(line, outfile, col.names = FALSE, append = TRUE, sep = ",", row.names = FALSE)
}
The key is setting append = TRUE
to write.table()
, to append one line at a time to the existing text file (or to create it where it does not exist).
I set N_ROWS
to 11
here as we know the input had 10
rows and a header. If you do not know how many lines there are, you can check without reading the file into memory by using fpeek::peek_count_lines()
to check the number of lines in the first chunk, i.e:
N_ROWS <- fpeek::peek_count_lines(paths[1])
Or if you prefer there are other approaches in the answers to this question.
This approach has the advantage of not loading any of the input files into memory at once, which you indicated in the comments was your preference. It reads one line of each input file at a time. This means there are a lot more read operations, so it will be much slower than reading each file once, but that's the trade-off. Depending on your memory constraints, you may be able to speed things up by reading more than one line at a time. If you do this, make sure to read and write the header separately to avoid the following rows being coerced into character
.
Output
We can check that, when read back in, the output is the same as dat
, the original matrix:
m <- as.matrix(read.csv(outfile))
head(m, 2)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 1 11 21 31 41 51 61 71 81 91
# [2,] 2 12 22 32 42 52 62 72 82 92
identical(dat, m)
# [1] TRUE