37

I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.

If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field: 
--max-vsize=500M 

Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.

> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update1

My 1st try based upon suggested answer

> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update2

my 2nd try based upon suggested answer is as below

thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE, 
+                    first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)

How could I read this file into a single object so that I can analyze the entire data in one go

------------------update 3

We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM

Community
  • 1
  • 1
user2543622
  • 5,760
  • 25
  • 91
  • 159
  • 2
    Possible duplicate of [Trimming a huge (3.5 GB) csv file to read into R](http://stackoverflow.com/questions/3094866/trimming-a-huge-3-5-gb-csv-file-to-read-into-r) – HFBrowning Jul 22 '16 at 22:46
  • 1
    Can you specify what you intend to do with the data ; in particular if your first steps would be to aggregate them or use only some of the variables? `ff` is a solution but relevance depends on what you will do. Another option is for instance to combine `ff` to read then store in a db -- you may be interested with that regards in MonetDB, incorporated in `MonetDBLite` package – Eric Lecoutre Jul 26 '16 at 08:46
  • Please let us number of rows and columns in your file. – user1436187 Jul 27 '16 at 00:08
  • @EricLecoutre I am planning to explore the data. Once I plot it and understand it better, I might delete some rows and/or columns – user2543622 Jul 27 '16 at 16:46
  • @user1436187 36 columns and 47,368,186 rows... – user2543622 Jul 27 '16 at 16:48
  • Related: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns – Tung Dec 19 '19 at 15:43

5 Answers5

32

Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.

In addition, you can read in the file in chunks:

file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)

You can use data.table to handle reading and manipulating large files more efficiently:

require(data.table)
fread("in.csv", header = T)

If needed, you can leverage storage memory with ff:

library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, 
                   first.rows=10000, next.rows=50000, colClasses=NA)
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • I tried `thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)` and got a message that `Warning messages: 1: In fread("C:/Users/a-vs/results_20160291.csv", : Reached total allocation of 16383Mb: see help(memory.size)` How could i read this file into a single object so that I can analyze the entire data in one go – user2543622 Jul 22 '16 at 23:47
  • 3
    @user2543622 Use `ff`. But just for the record chunking large files into pieces is standard practice in Big Data. The other answer is that you could preprocess the data a bit in SQL first. Perhaps once you get it in R you can send some of it to a sparse matrix as well. – Hack-R Jul 23 '16 at 00:55
18

You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.

Install SQLite

https://www.sqlite.org/download.html

Enter the data into a new SQLite database

  • Save the following in a new file named import.sql

CREATE TABLE tableName (COL1, COL2, COL3, COL4); .separator , .import YOURDATA.csv tableName

Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.

  • Import the data into the SQLite database via the command line

sqlite3.exe BIGDATA.sqlite3 < import.sql

Point dplyr to the SQLite database

As we're using SQLite, all of the dependencies are handled by dplyr already.

library(dplyr) my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE) my_tbl <- tbl(my_db, "tableName")

Do your exploratory analysis

dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]
#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>  
#>         COL2    meanVal
#>        <chr>      <dbl>
#>  1      1979   15.26476
#>  2      1980   16.09677
#>  3      1981   15.83936
#>  4      1982   14.47380
#>  5      1983   15.36479
Jonathan Carroll
  • 3,897
  • 14
  • 34
  • This is an excellent suggestion! Thank you so much! I have been working with a very large csv file. I'll now follow your suggestion and try sqlite. Thanks! – Michel Mesquita Sep 19 '20 at 16:39
12

This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.

DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 40001072 bytes #40 MB

Two OOM larger:

DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 4000001072 bytes #4.00 GB

There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using

  • x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).

  • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).

Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.

Chris
  • 6,302
  • 1
  • 27
  • 54
2

This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling memory.limit(new) where new an integer with R's new memory.limit in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).

If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.

Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2)) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.

Adam
  • 648
  • 6
  • 18
2

You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a for loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.

x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
    read.csv(con, nrows=CHUNK_SIZE,...)
}

Hope that helps.

Woody1193
  • 7,252
  • 5
  • 40
  • 90