0

I have a csv dataset of 2M instances and about 1300 variables.

I tried different ways of importing the dataset in R on different platforms, but none worked. I tried read.csv, read_csv and fread, I tried both RStudio and VisualStudioCode.

My laptop has 7.9GB RAM, it is entirely occupied (100%) while running (no other windows are open nor are any programs running, besides taskmanager to keep an eye on the RAM, CPU and disks).

The one where I get the most information is read_csv (on both platforms), where it runs until about 66%, but then it stops running without throwing any error, it just stops. read_csv function halts here (VisualStudioCode), for a few hours, I tried this multiple times RStudio when running (sorry for the quality, couldn't use prt sc) RStudio when process seems to have halted

In RStudio I cannot load or open the csv file, if I try it crashes. In VisualStudioCode I can open the csv file and look at the different variables. I selected the ones I need for my research and tried to select those in the process of importing the dataset (I believed it selects before importing, so it would take significantly less resources and time, but I am not sure anymore at this moment):

my_data <- read_csv("G:/2005-2019_FinancialData_excl_D&GR_EDIT.csv", col_select = variables_to_use_2)

Why does my laptop stop running? Is my RAM too small? How can I resolve this issue?

Robin
  • 1
  • 1
  • 1
    The way to do this is to read the file in smaller pieces. `read.csv` has arguments `skip` and `nrows` to implement this. Depending on what you're doing, you may be able to put the pieces together after reading, or you may have to process them separately and combine the results at the end. – user2554330 Apr 11 '23 at 16:10
  • You could try [awk](https://stackoverflow.com/a/33438149/6574038) on the command line. – jay.sf Apr 11 '23 at 16:14
  • Thank you for your quick answer. If I load the csv in RStudio, and View the dataset, then I get 1600 instances out of 2M and all variables. Is there a more efficient way to try to create a smaller dataset of all 2M instances and only 169 (specified) variables instead of 1300? If I would have to split I would have to iterate this about 1250 times.. (1600/2000000 rows given the first time => 1250 * 1600 instances = 2M) – Robin Apr 11 '23 at 16:16
  • More specifically, can I select certain columns to import instead of rows? 'ncols' give the error: 'unused argument' – Robin Apr 11 '23 at 16:20
  • 2
    *"and only 169 (specified) variables instead"* this is a huge time saver. `fread` and `read_csv` will both let you specify the columns you want and skip the rest. See the R-FAQ on [Read only selected columns?](https://stackoverflow.com/q/5788117/903061). – Gregor Thomas Apr 11 '23 at 16:20
  • Indeed, as I also mentioned in the Q above I tried this and even then I got the issue. Would it help to try f.e. 30 vars at a time instead of all 169 immediately? – Robin Apr 11 '23 at 16:23
  • Also check the `vroom` package ... – Ben Bolker Apr 11 '23 at 16:29
  • @jay.sf how do I apply awk? I cannot get a grasp at the code in the Q's where it is suggested. – Robin Apr 11 '23 at 16:29
  • Try `col_types` argument. You can specify exact column types that avoids guessing and will read the data faster. You can also set some column types to '-' or '_' to skip them. Read the doc for `read_csv`. – yuk Apr 11 '23 at 16:39
  • @BenBolker I believe since version 2.0 `readr` uses `vroom` under the hood. – Gregor Thomas Apr 11 '23 at 16:43
  • 169 isn't **that** many columns. I would think limiting rows would be more natural. Does it work if you set `n_max = 100`? How about `n_max = 1000`? How about `n_max = 5e5`? Start with 100 (or even 10) to make sure your column names and syntax all work. – Gregor Thomas Apr 11 '23 at 16:46
  • @yuk can I only mention the 169 cols? Or do I need to mention all 1300 and specify the types? – Robin Apr 11 '23 at 17:06
  • @GregorThomas I tried your suggestion, it works uptil 1400. n_max = 1500 ends the task and gives the error : Warning message: One or more parsing issues, call `problems()` on your data frame for details, e.g.: dat <- vroom(...) problems(dat) the code I used is: ´´´data_firms <- read_csv("G:/2005-2019_FinancialData_excl_D&GR_EDIT.csv", col_select = all_of(var_firms), n_max = 1500, show_col_types = FALSE)´´´ – Robin Apr 11 '23 at 17:10
  • Check this Q: https://stackoverflow.com/q/31150351/163080. – yuk Apr 11 '23 at 17:20
  • 1
    Sounds like your data has irregularities between lines 1400 and 1500. I would suggest [using a command line tool to pull those lines into a separate file](https://stackoverflow.com/q/12182910/903061) and examine them manually / focus on finding a syntax that avoids the problem on that small example. – Gregor Thomas Apr 11 '23 at 17:43
  • @Robin https://www.howtogeek.com/562941/how-to-use-the-awk-command-on-linux/ There are also versions around for other OS'es. Cheers! – jay.sf Apr 12 '23 at 02:54

0 Answers0