The error is likely because your file does not have sufficient memory (RAM) to process it. I have 64GB of RAM, it took well over 6 minutes to read in the file, and the resulting object is over 13GB in R (the original file is over 20GB in size, uncompressed). I received no error.
Since it appears you cannot read that entire dataset into RAM at one time, I suggest you would benefit from alternative methods of getting at your data. None of these allow you to read all data into R: nothing you can do will allow that. (Technically, you can select=
a subset of the columns to load in, but that assumes there are "large memory" columns you know you'll never use.)
Load into a database. One could use a "big server" DBMS such as postgres, mariadb, or sql server. Doing this is a lot more than "just R", outside the scope of this question. However, one might be able to use RSQLite or DuckDB (single-file databases). For instance, on the shell (not R):
SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/
$ sqlite3 US_TradeMatrix.sqlite3
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
DuckDB: https://duckdb.org/docs/data/csv/overview.html
(Note that the leading D
is the prompt string, not typed in by the user.)
$ duckdb US_TradeMatrix.duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);
I made guesses on the table schema, please verify.
From here, one would use DBI
and either RSQLite
or duckdb
to fashion SQL statements that reduce the data (and optionally do other things) before attempting to bring into R.
Lazy operations using duckdb
on the CSV file itself, still requiring SQL. (Note that this might not be as fast as importing into a duckdb
file as in #1 above. I haven't benchmarked it, but I cannot imagine random-access on CSV files would compare well with indexed database efficiencies.)
duck <- DBI::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10")
# Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
# 1 2016 2016 0000 World 0000 World 01 Imports 0 Food and live animals 1068138072 <NA>
# 2 2016 2016 0000 World 0000 World 01 Imports 00 Live animals other than animals of division 03 21120296 <NA>
# 3 2016 2016 0000 World 0000 World 01 Imports 001 Live animals other than animals of division 03 21120296 <NA>
# 4 2016 2016 0000 World 0000 World 01 Imports 01 Meat and meat preparations 128581301 <NA>
# 5 2016 2016 0000 World 0000 World 01 Imports 011 Meat of bovine animals, fresh, chilled or frozen 39233046 <NA>
# 6 2016 2016 0000 World 0000 World 01 Imports 012 Other meat and edible meat offal 66138000 <NA>
# 7 2016 2016 0000 World 0000 World 01 Imports 016 Meat, edible meat offal, salted, dried; flours, meals 4135049 <NA>
# 8 2016 2016 0000 World 0000 World 01 Imports 017 Meat, edible meat offal, prepared, preserved, n.e.s. 19075207 <NA>
# 9 2016 2016 0000 World 0000 World 01 Imports 02 Dairy products and birds' eggs 76369447 <NA>
# 10 2016 2016 0000 World 0000 World 01 Imports 022 Milk, cream and milk products (excluding butter, cheese) 37031756 <NA>
Lazy operations with arrow
, using dplyr
as the language:
tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv")
tradematrix %>%
head(n=10) %>%
collect()
# # A tibble: 10 × 12
# Year `Year Label` Economy `Economy Label` Partner `Partner Label` Flow `Flow Label` SitcRev3Product `SitcRev3Product Label` `US dollars at current prices in thousands` US dollars at current prices…¹
# <int> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <dbl> <???>
# 1 2016 2016 0 World 0000 World 1 Imports 0 Food and live animals 1068138072. .
# 2 2016 2016 0 World 0000 World 1 Imports 00 Live animals other than animals of division 03 21120296. .
# 3 2016 2016 0 World 0000 World 1 Imports 001 Live animals other than animals of division 03 21120296. .
# 4 2016 2016 0 World 0000 World 1 Imports 01 Meat and meat preparations 128581301. .
# 5 2016 2016 0 World 0000 World 1 Imports 011 Meat of bovine animals, fresh, chilled or frozen 39233046. .
# 6 2016 2016 0 World 0000 World 1 Imports 012 Other meat and edible meat offal 66138000. .
# 7 2016 2016 0 World 0000 World 1 Imports 016 Meat, edible meat offal, salted, dried; flours, meals 4135049. .
# 8 2016 2016 0 World 0000 World 1 Imports 017 Meat, edible meat offal, prepared, preserved, n.e.s. 19075207. .
# 9 2016 2016 0 World 0000 World 1 Imports 02 Dairy products and birds' eggs 76369447. .
# 10 2016 2016 0 World 0000 World 1 Imports 022 Milk, cream and milk products (excluding butter, cheese) 37031756. .
# # ℹ abbreviated name: ¹`US dollars at current prices in thousands Footnote`
This method supports most of dplyr
s verbs, and many base R functions within mutate
and summarize
.