2

Using RStudio, I am trying to read in the Gene_expression_matrix.csv file from the Brain Allen Institute, and the file is too large, even for computers with large amounts of RAM (I have access to and have tried it on a laptop with 64 GB RAM and a computer with 384 GB RAM. Has anyone accessed this file or any of a similar size? Thanks!

I'm using this code:

Gene_expression_matrix <- read.csv("Gene_expression_matrix.csv")

The error message I receive is:

Error: cannot allocate vector of size 3.9 Mb
Val
  • 59
  • 6
  • 2
    What is the error message your getting? Also look at `disk.frame` package, which is intended for on-disk data analysis. It does limit certain actions, such as regressions. So you prolly should read through the package documentation on their [webpage](https://diskframe.com/) as well. – Oliver Aug 09 '20 at 18:42
  • @Oliver The error message is: Error: cannot allocate vector of size 3.9 Mb I'll look into the disk.frame package – Val Aug 09 '20 at 19:42
  • 1
    Try reading subsets with `data.table::fread(..., nrows=...)` and see how the `object.size()` scales with the number of rows. Also, search for "High performance task view CRAN" for tools for working with large and out-of-memory data. Is the file compressed on disk? It's not shocking that it should expand to 5 times its size when reading into R ... – Ben Bolker Aug 09 '20 at 19:54
  • Are you on windows or linux? What is your `memory.limit()` if on windows? – Oliver Aug 09 '20 at 20:01
  • @Oliver I'm using windows, and my memory.limit() is 65276. – Val Aug 09 '20 at 20:11
  • Alright, the problem is that your [memory limit](https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes) is not sufficient. Some measures can be done to [increase it](https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes), i would suggest trying this in addition to using `disk.frame`. – Oliver Aug 09 '20 at 20:16
  • @BenBolker I gradually increased the number of rows by reading subsets. Here is my output: nrows=50, object.size()=12008848 bytes (12.008848 MB) nrows=1000, object.size()=130109048 bytes (130.109048 MB) nrows=10000, object.size()=1248953048 bytes (1248.953048 MB) nrows=100000, R encountered a fatal error. The session was terminated. – Val Aug 09 '20 at 20:22
  • @Ben Bolker By increasing the size of my memory limit to memory.limit = 5e+6. Then, I was able to run fread and read the table into RStudio! However, the further analysis (changing the table into a matrix, in this case), crashed my RStudio. In general, I'm wondering how I could work with the table in RStudio due to its size. Although I can read it into RStudio, wouldn't it still be too big to do anything with? Thanks! – Val Aug 09 '20 at 22:29
  • 2
    This is where the suggestions about checking out the [high-performance task view](https://cran.r-project.org/view=HighPerformanceComputing) (especially the " Large memory and out-of-memory data " section) come into play. Or the `disk.frame` package recommended by @Oliver above. – Ben Bolker Aug 09 '20 at 22:56
  • 1
    @Val, adding to @BenBolker, what you experience is that many actions will create copies of your data. Your data seems to barely fit into memory, so this will cause R to crash. Life gets complicated when data gets too large to fit in memory, and that is why packages like `disk.frame` exist (and others suggested by @BenBolker). At that point it often becomes necessary to stay within the large-data package environment. eg. you will want to stay with a `disk.frame` or whatever alternative you choose, and use methods implemented specifically to these objects. – Oliver Aug 10 '20 at 05:53
  • @BenBolker @Oliver Thank you both for the suggestions! I worked through parts of the disk.frame package according to the link, but even the part for large .csv files did not work because of similar errors ("unable to allocate vector of size __ Mb"). I will continue to look into the high-performance task view link to see what options are available. Would using packages (such as `disk.frame`) or databases allow you to use functions from other packages (specifically, monocle3) on the data from the .csv files? Thanks again! – Val Aug 10 '20 at 18:32
  • 2
    It depends. `disk.frame` does say somewhere that it allows use of general functions, I think. This is a moderately challenging task to debug remotely. Did you follow https://diskframe.com/articles/ingesting-data.html and use `csv_to_disk.frame()` ? (sorry, `csv_to_disk.frame(path, in_chunk_size = 1e6)`) ? – Ben Bolker Aug 10 '20 at 19:29
  • @BenBolker I followed `csv_to_disk.frame(path, in_chunk_size = 1e6)`, but I cannot find the directory with the .df file created, so I'm not sure how to access it. Here is the output when I try to access it from my command prompt: `C:\Users\myname\AppData\Local\Temp\Rtmp4gOaRj>file301c29034ba6.df 'file301c29034ba6.df' is not recognized as an internal or external command, operable program or batch file.` How would you access the path given as the output in RStudio? The code did seem to run, though - thanks! – Val Aug 11 '20 at 14:36
  • @BenBolker I also tried "Load the individual files", "Load one large-file (splitting)", and "Load one large-file (no splitting)" sections here https://diskframe.com/articles/more-epic.html, but these all resulted in "Error: cannot allocate vector of size" 4.6 Mb, 4.6 Mb, and 38.1 Mb, respectively. – Val Aug 11 '20 at 14:41
  • I'm sorry, but I don't think it's going to be practical to do all of this remotely/on StackOverflow. I'd strongly recommend that you try to get some more local help ... – Ben Bolker Aug 11 '20 at 14:43
  • @BenBolker Sounds good - I understand that it's hard to troubleshoot remotely. Thanks for all your help! – Val Aug 11 '20 at 15:02
  • 1
    good luck. Quick thought about your comment above (about following `csv_to_disk.frame`): you should save the results of `csv_to_disk.frame` as an object (e.g. `df <- csv_to_disk.frame(...)`), then read the disk.frame docs about how to work with that object -- you shouldn't need to find the files on disk ... – Ben Bolker Aug 11 '20 at 16:36
  • @BenBolker Thank you! I will try that and let you know how it goes. – Val Aug 11 '20 at 20:38
  • @val `df <- csv_to_disk.frame(path, in_chunk_size = 1e6)` and then type `df` into console will show you the path. You should use `df <- csv_to_disk.frame(path, in_chunk_size = 1e6, outdir = "c:/where/i/want/the/diskframe.df/")` so you know exactly where it is. Otherwise, it will be stored in the temp directory. – xiaodai Aug 22 '20 at 05:53
  • @val is the file publically accessible? I'd like to try it. – xiaodai Aug 22 '20 at 05:55

2 Answers2

1

You can use disk.frame like this

library(disk.frame)
setup_disk.frame()

Gene_expression_matrix.df <- csv_to_disk.frame(
   "Gene_expression_matrix.csv",
   outdir = "c:/this/is/where/the/output/is" # specify a path for where you want to save the file
)

If the above fails, then try to limit the amount you read by specifying in_chunk_size which will only read in_chunk_size rows at a time to limit RAM usage. E.g.

Gene_expression_matrix.df <- csv_to_disk.frame(
   "Gene_expression_matrix.csv",
   outdir = "c:/this/is/where/the/output/is", # specify a path for where you want to save the file
   in_chunk_size = 1e7 # read 10 million rows at a time; adjust down if still runs of out RAM
)

Once the data is loaded, you can use dplyr verbs and some common functions to look at your data. See this quick start.

For example

head(Gene_expression_matrix.df)

I am sure {disk.frame} can help in this case as it is designed for this! If you run into issues, please raise a ticket here and I will help you.

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • would disk frame outperform datatable with a 50GB file? – Cauder Sep 08 '20 at 20:04
  • @Cauder if data can fit in RAM then no. But disk.frame enables what is not possible. E.g.if your data doesn't even fit in RAM – xiaodai Sep 09 '20 at 00:42
  • That's sweet. Would you mind taking a glance at this question? https://stackoverflow.com/questions/63782007/how-do-count-unique-entities-with-disk-frame-in-r – Cauder Sep 09 '20 at 00:43
0

try this library

library('data.table')
Gene_expression_matrix <- fread("Gene_expression_matrix.csv")

it is extremely faster than read.csv.

Reza
  • 1,945
  • 1
  • 9
  • 17
  • It did finish running earlier, but I got this error message: Error: cannot allocate vector of size 4.6 Mb – Val Aug 09 '20 at 19:37