0

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.

It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).

It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...

library(sqldf)
library(here)

read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")

I get this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 10 did not have 196 elements

Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.

Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).

Jay Bee
  • 362
  • 1
  • 9
  • Try data.table's `fread` function. – Edward Dec 23 '22 at 03:25
  • Thank you, I tried this but got the following error: `Error in fread(here("02. Data", "en.openfoodfacts.org.products.csv"), : Opened 6.970GB (7484470456 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.` – Jay Bee Dec 23 '22 at 03:29
  • 1
    With a 7GB CSV file, there is nothing you can do to read it into your R instance. I suggest some alternatives for working with data that large, assuming that you're really only looking at a (significantly smaller) subset of it at any one time: [`arrow`](https://cran.r-project.org/web/packages/arrow), [`disk.frame`](https://cran.r-project.org/web/packages/disk.frame), or load into a database (e.g., monolithic like [`duckdb`](https://cran.r-project.org/web/packages/duckdb) or [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite), or a DBMS such as PostgreSQL, MariaDB, or SQL Server). – r2evans Dec 23 '22 at 16:05
  • 1
    However, since parquet (arrow) files cannot be easily concatenated (without loading into memory), your better bet might be to use one of the DBMS options. (Though once in duckdb, one can [export to parquet](https://duckdb.org/docs/guides/import/parquet_export.html) directly.) – r2evans Dec 23 '22 at 16:08

1 Answers1

1

Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:

  • DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
  • Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.

There are ways to get a large CSV file into each of this.

r2evans
  • 141,215
  • 6
  • 77
  • 149