Can not read file in data.table?

Question

i downloaded a file from UNCTAD and unzip then i read using data.table package

https://unctadstat.unctad.org/7zip/US_TradeMatrix_Part5.csv.7z

temp <- data.table::fread("US_TradeMatrix_Part5_ST202209201450_v1.csv", 
      drop = c("US dollars at current prices in thousands Footnote", "Flow"), 
      check.names = TRUE, showProgress = TRUE)

I get the following error R character strings are limited to 2^31-1 bytes please advise.

Could be a separator while reading the csv file . Use sep= "|" (or one of . or , depending on the separator used in your file. Also check the comments here https://stackoverflow.com/questions/75438305/r-character-strings-are-limited-to-231-1-bytes-error-while-trying-to-load-txt for further suggestions — Imitation, Jul 17 '23 at 13:11
it could have something to do with it being a 2.6 GB csv file (just saying) — Mark, Jul 17 '23 at 13:47
@Imitation the delimiter is `,`, but I think `fread` can handle it — Mark, Jul 17 '23 at 14:01
@Mohamed what happens if you set the number of rows read to be smaller than all of them (i.e. running it with `nrows = 100`) ? — Mark, Jul 17 '23 at 14:02
if i use sep = "|", it give me 1 variable only, if i use sep = "," it give me 9 variables — Mohamed, Jul 17 '23 at 19:09
if i pass nrows = 1e4 and if i use sep = "|", it give me 1 variable only, if i use sep = "," it give me 9 variables. however i read the whole file with sep = "," or sep = "|", i get the same error — Mohamed, Jul 17 '23 at 19:25
Mohamed, `fread` works as-is without needing to change `sep=` (in fact, changing it will likely fail in other ways, as you're discovering). The error does not evidence for me (see my answer) which suggests either you have a corrupted `.7z` download (seems less likely) or you need more RAM in your system. Since on my system the read-in object was over 13GB, that suggests to me that your computer needs ***MORE THAN*** 16GB of RAM to read it, since the OS and other running processes on your computer will almost certainly want/need more than a few GBs. — r2evans, Jul 20 '23 at 21:55

r2evans · Answer 1 · 2023-07-19T11:58:15.240

The error is likely because your file does not have sufficient memory (RAM) to process it. I have 64GB of RAM, it took well over 6 minutes to read in the file, and the resulting object is over 13GB in R (the original file is over 20GB in size, uncompressed). I received no error.

Since it appears you cannot read that entire dataset into RAM at one time, I suggest you would benefit from alternative methods of getting at your data. None of these allow you to read all data into R: nothing you can do will allow that. (Technically, you can select= a subset of the columns to load in, but that assumes there are "large memory" columns you know you'll never use.)

Load into a database. One could use a "big server" DBMS such as postgres, mariadb, or sql server. Doing this is a lot more than "just R", outside the scope of this question. However, one might be able to use RSQLite or DuckDB (single-file databases). For instance, on the shell (not R):
- SQLite: https://www.sqlitetutorial.net/sqlite-import-csv/
```
$ sqlite3 US_TradeMatrix.sqlite3
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import US_TradeMatrix_Part5_ST202306191712_v1.csv TradeMatrix
```
- DuckDB: https://duckdb.org/docs/data/csv/overview.html
  
  (Note that the leading D is the prompt string, not typed in by the user.)
```
$ duckdb US_TradeMatrix.duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
D CREATE TABLE TradeMatrix (Year INT, YearLabel INT, Economy INT, EconomyLabel VARCHAR, Partner VARCHAR, ParnerLabel VARCHAR, Flow INT, FLowLabel VARCHAR, SitcRev3Product VARCHAR, SitcRev3ProductLabel VARCHAR, USdollars FLOAT, Footnote VARCHAR)
D COPY TradeMatrix FROM 'US_TradeMatrix_Part5_ST202306191712_v1.csv' (AUTO_DETECT TRUE);
```
  I made guesses on the table schema, please verify.
From here, one would use DBI and either RSQLite or duckdb to fashion SQL statements that reduce the data (and optionally do other things) before attempting to bring into R.

Lazy operations using duckdb on the CSV file itself, still requiring SQL. (Note that this might not be as fast as importing into a duckdb file as in #1 above. I haven't benchmarked it, but I cannot imagine random-access on CSV files would compare well with indexed database efficiencies.)

duck <- DBI::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(duck, "select * from 'US_TradeMatrix_Part5_ST202306191712_v1.csv' limit 10")
#    Year Year Label Economy Economy Label Partner Partner Label Flow Flow Label SitcRev3Product                                    SitcRev3Product Label US dollars at current prices in thousands US dollars at current prices in thousands Footnote
# 1  2016       2016    0000         World    0000         World   01    Imports               0                                    Food and live animals                                1068138072                                               <NA>
# 2  2016       2016    0000         World    0000         World   01    Imports              00           Live animals other than animals of division 03                                  21120296                                               <NA>
# 3  2016       2016    0000         World    0000         World   01    Imports             001           Live animals other than animals of division 03                                  21120296                                               <NA>
# 4  2016       2016    0000         World    0000         World   01    Imports              01                               Meat and meat preparations                                 128581301                                               <NA>
# 5  2016       2016    0000         World    0000         World   01    Imports             011         Meat of bovine animals, fresh, chilled or frozen                                  39233046                                               <NA>
# 6  2016       2016    0000         World    0000         World   01    Imports             012                         Other meat and edible meat offal                                  66138000                                               <NA>
# 7  2016       2016    0000         World    0000         World   01    Imports             016    Meat, edible meat offal, salted, dried; flours, meals                                   4135049                                               <NA>
# 8  2016       2016    0000         World    0000         World   01    Imports             017     Meat, edible meat offal, prepared, preserved, n.e.s.                                  19075207                                               <NA>
# 9  2016       2016    0000         World    0000         World   01    Imports              02                           Dairy products and birds' eggs                                  76369447                                               <NA>
# 10 2016       2016    0000         World    0000         World   01    Imports             022 Milk, cream and milk products (excluding butter, cheese)                                  37031756                                               <NA>

Lazy operations with arrow, using dplyr as the language:

tradematrix <- arrow::open_csv_dataset("US_TradeMatrix_Part5_ST202306191712_v1.csv")
tradematrix %>%
  head(n=10) %>%
  collect()
# # A tibble: 10 × 12
#     Year `Year Label` Economy `Economy Label` Partner `Partner Label`  Flow `Flow Label` SitcRev3Product `SitcRev3Product Label`                                  `US dollars at current prices in thousands` US dollars at current prices…¹
#    <int>        <int>   <int> <chr>           <chr>   <chr>           <int> <chr>        <chr>           <chr>                                                                                          <dbl>                          <???>
#  1  2016         2016       0 World           0000    World               1 Imports      0               Food and live animals                                                                    1068138072.                              .
#  2  2016         2016       0 World           0000    World               1 Imports      00              Live animals other than animals of division 03                                             21120296.                              .
#  3  2016         2016       0 World           0000    World               1 Imports      001             Live animals other than animals of division 03                                             21120296.                              .
#  4  2016         2016       0 World           0000    World               1 Imports      01              Meat and meat preparations                                                                128581301.                              .
#  5  2016         2016       0 World           0000    World               1 Imports      011             Meat of bovine animals, fresh, chilled or frozen                                           39233046.                              .
#  6  2016         2016       0 World           0000    World               1 Imports      012             Other meat and edible meat offal                                                           66138000.                              .
#  7  2016         2016       0 World           0000    World               1 Imports      016             Meat, edible meat offal, salted, dried; flours, meals                                       4135049.                              .
#  8  2016         2016       0 World           0000    World               1 Imports      017             Meat, edible meat offal, prepared, preserved, n.e.s.                                       19075207.                              .
#  9  2016         2016       0 World           0000    World               1 Imports      02              Dairy products and birds' eggs                                                             76369447.                              .
# 10  2016         2016       0 World           0000    World               1 Imports      022             Milk, cream and milk products (excluding butter, cheese)                                   37031756.                              .
# # ℹ abbreviated name: ¹`US dollars at current prices in thousands Footnote`

This method supports most of dplyrs verbs, and many base R functions within mutate and summarize.

I contacted UNCTAD and they requested to download the file again. Problem is solved. However I would like to thank you specially r2evans for directing me to arrow package. — Mohamed, Jul 23 '23 at 17:26

Can not read file in data.table?

1 Answers1