2

imagine that I have a huge matrix (toy example)

dat <- matrix(1:100,ncol = 10) 
colnames(dat) <- paste0("X",1:ncol(dat))
dat
      X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
 [1,]  1 11 21 31 41 51 61 71 81  91
 [2,]  2 12 22 32 42 52 62 72 82  92
 [3,]  3 13 23 33 43 53 63 73 83  93
 [4,]  4 14 24 34 44 54 64 74 84  94
 [5,]  5 15 25 35 45 55 65 75 85  95
 [6,]  6 16 26 36 46 56 66 76 86  96
 [7,]  7 17 27 37 47 57 67 77 87  97

this matrix is divided into columns and written in several files

path <- paste0(getwd(),"/example/")
dir.create(path = path)
colum_id <- matrix(1:10,ncol = 2,byrow = T)
for(i in 1:nrow(colum_id)) write.csv(dat[,colum_id[i,]], file = paste0(path,i,".csv"), row.names = F)

files in a folder dir(path)

[1] "1.csv" "2.csv" "3.csv" "4.csv" "5.csv"

.. read files

paths <- paste0(path,dir(path))
lapply(paths , \(Path) head(read.csv(Path),2) )

[[1]]
  X1 X2
1  1 11
2  2 12

[[2]]
  X3 X4
1 21 31
2 22 32

[[3]]
  X5 X6
1 41 51
2 42 52

[[4]]
  X7 X8
1 61 71
2 62 72

[[5]]
  X9 X10
1 81  91
2 82  92

The question is, how can I combine in R all these files column by column without loading them completely into memory (because the files are very large) to get the whole file on disk as a result.

mr.T
  • 181
  • 2
  • 13
  • a couple of questions: 1. are you wanting the output to be a csv file, and 2. if yes, then what is the purpose of a csv file you can't load? – Mark Aug 11 '23 at 10:17
  • 1
    This does seem like a case where a small (SQL?) database might do the job. – Paul Stafford Allen Aug 11 '23 at 10:24
  • 1
    @Mark 1) desirable but not required 2) i can load it with an `arrow` for example – mr.T Aug 11 '23 at 10:29
  • 2
    Without R, if you have access to bash try: `paste -d "," *.csv > newFile.csv` – zx8754 Aug 11 '23 at 10:36
  • @ Paul Stafford Allen can you show an example here please – mr.T Aug 11 '23 at 10:41
  • @ zx8754 I don't have access to `bash` I'm interested in the solution from `R` – mr.T Aug 11 '23 at 10:43
  • 1
    @zx8754 you could probably rig together something with system2() if you wanted? :-) would just have to check OS – Mark Aug 12 '23 at 01:30
  • 2
    @Mark yes, task is better suited outside R. Or within R, detect system and use "if else" to run within system2 that is compatible with OS. – zx8754 Aug 14 '23 at 08:10
  • @mr.T are you asking, 1) how to do this without loading **all** individual input files (e.g. `1.csv`, `2.csv`) into memory all at the same time, or 2) how to do this without loading **any** individual input file into memory at one time? – SamR Aug 14 '23 at 13:29
  • @ SamR I prefer option 2) – mr.T Aug 14 '23 at 14:06

1 Answers1

3

Appending a column means writing to the middle of a csv file

Appending columns makes sense in a columnar data format such as SQL or arrow. The suggestion in the comments to use SQL is a good one and here is an answer using RSQLite which does this.

In a row-based format, like csv, appending a column is essentially writing to the middle of a plain text file. This means loading the file from (at best) somewhere on the first line, and re-writing everything after that point.

Append rows to a csv file instead

The best way to write a large plain text file, without loading the entire contents into RAM, is one line at a time. In your case, that means:

  1. Read one row at a time from each of the input files.
  2. cbind() each piece of the row to make a complete row.
  3. Write the row out as either (in the case of the first row) the first line of a new text file, or (for subsequent rows) a line appended to an existing text file.

This would simply be:

outfile <- "./example/big_out_file.csv"
N_ROWS <- 11
for (i in seq(0, N_ROWS - 1)) {
    line <- do.call(
        cbind,
        lapply(paths, \(f) read.table(f, sep = ",", skip = i, nrows = 1))
    )
    write.table(line, outfile, col.names = FALSE, append = TRUE, sep = ",", row.names = FALSE)
}

The key is setting append = TRUE to write.table(), to append one line at a time to the existing text file (or to create it where it does not exist).

I set N_ROWS to 11 here as we know the input had 10 rows and a header. If you do not know how many lines there are, you can check without reading the file into memory by using fpeek::peek_count_lines() to check the number of lines in the first chunk, i.e:

N_ROWS  <- fpeek::peek_count_lines(paths[1])

Or if you prefer there are other approaches in the answers to this question.

This approach has the advantage of not loading any of the input files into memory at once, which you indicated in the comments was your preference. It reads one line of each input file at a time. This means there are a lot more read operations, so it will be much slower than reading each file once, but that's the trade-off. Depending on your memory constraints, you may be able to speed things up by reading more than one line at a time. If you do this, make sure to read and write the header separately to avoid the following rows being coerced into character.

Output

We can check that, when read back in, the output is the same as dat, the original matrix:

m <- as.matrix(read.csv(outfile))
head(m, 2)
#      X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,]  1 11 21 31 41 51 61 71 81  91
# [2,]  2 12 22 32 42 52 62 72 82  92
identical(dat, m) 
# [1] TRUE
SamR
  • 8,826
  • 3
  • 11
  • 33
  • Great answer, thanks! can you take a look at a similar question about `arrow` https://stackoverflow.com/questions/76868729/how-to-read-out-of-memory-a-large-matrix-with-an-arrow-package-r – mr.T Aug 14 '23 at 16:06
  • @mr.T why not just `read_csv_arrow(outfile, as_data_frame = FALSE)` on the file we have created here? – SamR Aug 15 '23 at 12:55
  • I was wondering if it is possible to do this directly from the arrow – mr.T Aug 15 '23 at 18:46