1

I have a large (~18gb) csv file that I would like to read in chuncks. The chuncks are separately processed (filtered) and concatenated. Since I'm iterating through several chunks I'm using the skip parameter of the read_csv function.

Here is an example of a chunk:

#Size of each chunk to be read and processed
chunk_size = 2000000
#In the full code in each loop the row_skip increases by the size of the chunk.
row_skip = 2000000 #initial
col_names = ['..','..']
chunk <- read_csv("G:/../data.csv", skip = row_skip,
                  n_max = chunk_size, col_names = col_names)

My problem is that if the skip parameter is sufficiently large (~500 000 +) I get the following error:

Error: The size of the connection buffer (131072) was not large enough to fit a complete line: Increase it by setting Sys.setenv("VROOM_CONNECTION_SIZE")

I have already tried to change it (Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10000)) but the problem still persists. Now the skip paramater has to increase as I iterate through the chunks, hence no matter of the initial chunk size, it will reach the threshold when it produces the error.

I thought it could be caused by insufficient RAM, but I can read a much larger initial chunk than the value of the skip when it produces the error (e.g. chunk_size = 5000000 works, while row_skip = 2000000 already generates the error).

Phil
  • 7,287
  • 3
  • 36
  • 66
  • Okay, a Github [issue feed](https://github.com/r-lib/vroom/issues/364) helped with this. Shortly, `Sys.setenv("VROOM_CONNECTION_SIZE")` can't handle scientific notation, so an appropriate chunk size (e.g. 2000001) and connection size (e.g., 5000001) solves the issue. – Littlefatkid Sep 17 '21 at 14:37

0 Answers0