How can I load a large (3.96 gb) .tsv file in R studio

Question

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.

I tried library(data.table) to load my data but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)

I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.

as.data.frame(fread(file name))

I think you're going to have a really hard time dealing with that much data on your system. Several factors: (1) how much data you can just fit in memory depends on what kind of data it is: logical and integer are fairly small, numeric is generally double, character varies depending on the length of the strings. (2) Once you load it, what do you plan to actually *do* with it? Some operations in R are copy-on-write, meaning your memory requirements are much larger. If it's a frame-like object, the `data.table` package tends to be memory-frugal (e.g., `fread`), but still ... — r2evans, Mar 23 '19 at 06:00
it contains integer and numeric values but still i m not getting how to load my data — xyz, Mar 23 '19 at 06:04
Ultimately, *we can't know*, since we aren't there. As an example, I have a 584MB csv here (woefully smaller) that I've loaded with `data.table::fread`, and it takes 335MB sitting in memory (I've seen a worse ratio of on-disk to in-memory), not bad. Depending on the functions I'm using, the actual memory required to operate on this data ranges from an *additional* 300MB to well over 600MB more, depending on if I intentionally or accidentally keep copies of the data sitting around. — r2evans, Mar 23 '19 at 06:16
Additionally, your OS configuration can change things, too. What else is running? What OS? Do you have virtual memory configured? Though it might be feasible to make something work here (I can't tell from what we know), you are close to the line of "big data", where my loose definition includes discussion about *"more data than I can manipulate on this computer"*. My computer at work has 16x the amount of RAM this laptop does, to its practical "big data" limit is much higher. — r2evans, Mar 23 '19 at 06:18
Note that `as.data.frame` will make a copy of the dataset, thus doubling the size. Here is where you run out of memory. If you really want to work with a data.frame rather than a data.table, you should use `setDF` instead as it will convert to a data.frame without the copy. As everyone has mentioned, you will have difficulty doing anything complicated with this data as a whole given the amount of RAM you have. — lmo, Mar 23 '19 at 17:19

Oka · Answer 1 · 2019-03-23T15:20:07.837

If I were you, I probably would

1) try your fread code once more without the typo (closing parenthesis was initially missing):

as.data.frame(fread(file name))

2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.

3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.

Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..

ps. This topic is covered quite nicely in this post. pps. Thanks to @Yuriy Barvinchenko for comment on fread

score 1 · Answer 2 · answered Mar 23 '19 at 11:42

You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with

fread(file name, data.table=FALSE)

Also, it wouldn't hurt to run garbage collection.

gc()

score 1 · Answer 3 · answered Mar 23 '19 at 13:46

1

From my experience and in addition to @Oka answer:

fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()

This way I worked with 5GB csv file.

answered Mar 23 '19 at 13:46

Yuriy Barvinchenko

1,465
1
12
17

1

i was able to load data but system was running really slow – xyz Mar 25 '19 at 05:56
It's all about memory. You have 2 friends: data.table and rm()+gc() Data.table use the same address in memory when up modify the same table. So you need less memory. If you don't need some data in next steps, remove it from memory with rm() and do garbage collection with gc(). – Yuriy Barvinchenko Mar 25 '19 at 07:47

How can I load a large (3.96 gb) .tsv file in R studio

3 Answers3