5

I often work with big CSV files (>50GB) with >10,000 columns. I'm thinking of reading the information into RSQLite so that I can do easy querying and subsetting.

ISSUE

SQLite has a limit of 2000 fields and 999 host parameters in a single SQL (see SQLite website). According to this, there is a way to change these parameters for SQLite (i.e. by re-compiling SQLite from source).

QUESTION

How do I make the above changes if I'm using RSQLite package? The funny thing is that according to the package's changelog, the default values have already been increased to 30,000 and 40,000 respectively as early as ver 0.11.0 back in 2011-12-01. But when I test with dataframes with 10,000 columns, RSQLite v2.1.1 still threw an error.

What am I missing here? If I need to re-compile SQLite to change the values, how do I get RSQLite to use that new version?

Thanks in advance!

NoviceProg
  • 815
  • 1
  • 10
  • 22
  • 1
    I believe that `RSQLite` ships with SQLite. How about changing the default value in the `C` file and rebuilding the package? https://raw.githubusercontent.com/r-dbi/RSQLite/master/src/vendor/sqlite3/sqlite3.c – Maylo Dec 30 '19 at 10:44
  • 1
    Ok, it's a step forward though it's certainly above my 'pay-grade' since I've never built an R package from source before. Can I use `devtools` as mentioned in https://kbroman.org/pkg_primer/pages/build.html (middle of the page)? – NoviceProg Dec 30 '19 at 10:55
  • 1
    @r2evans Thanks for reminding me, I've done so already – NoviceProg Apr 21 '20 at 04:49

1 Answers1

5

I have reduced the limits on the number of columns to the defaults in RSQLite 1.1:

Compilation limits SQLITE_MAX_VARIABLE_NUMBER and SQLITE_MAX_COLUMN have been reset to the defaults. The documentation suggests setting to such high values is a bad idea.

SQLite stores the data on disk in rows. Querying a single column will require reading the entire file, or a substantial subset thereof.

Have you considered DuckDB? It's a new

embedded database designed to execute analytical SQL queries fast while embedded in another process. It is designed to be easy to install and easy to use. ... DuckDB has bindings for C/C++, Python and R.

(Emphasis mine.)

I'm told that it organizes data internally in combined row-column order: querying entire columns will only require reading a tiny fraction of the file, but returning entire rows (after filtering) is still fast. You can access a DuckDB database through DBI and dbplyr.

CAVEAT: I'm not aware if there is a similar limitation on the number of columns, but I would be greatly surprised if the limit is much smaller than 2³¹. Let me know if it works for you.

krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • 1
    Thanks for clarifying! I haven't heard of DuckDB, will take a look. – NoviceProg Dec 30 '19 at 13:43
  • Hey @krlmlr. I too am interested to look at DuckDB. Unless I'm missing something, though, it supports `timestamp` but only `no time zone`, do you know of any attempts to address that? (For me, I cannot even begin to trust a system that reliably ignores TZ.) (Sorry for the slight hijack.) – r2evans Apr 15 '20 at 16:51
  • @r2evans: Timestamps must be UTC. Strings must be UTF-8. Lines must end with `"\n"`. I think DuckDB has it right. – krlmlr Apr 16 '20 at 03:20
  • I don't think it's wrong to use time zones to make parsing times easier to understand for humans. It's either an attribute of the column, or if you really need different time zones for each row, it can be stored in a second column. – krlmlr Apr 16 '20 at 03:21
  • Databases that "support" times with time zones do this in a terribly inconsistent and sometimes broken way, in my experience. For a problem like this it's better to take a step back I think. – krlmlr Apr 16 '20 at 03:23
  • Interesting point. Inconsistent, yes. My experience has shown me that I cannot trust humans to do tz/dst math, and too many programmers trust simple addition or subtraction for converting to UTC (forgetting DST). I don't suggest that including TZ makes it easier to parse strings, it's because having the offset can be informative (and keep humans from using `+`/`-`). – r2evans Apr 16 '20 at 13:41
  • @r2evans: Timestamps are normally stored as number of seconds/days since some epoch. No programmer is supposed to use + or - for time conversion, there are libraries that do that. I agree that for presentation/UI it's great to display local times if this is appropriate, still times must be stored as UTC internally (and I will do everything to defend this hill). – krlmlr Apr 17 '20 at 03:58
  • So this becomes a question about how to specify that a column wants its time to be presented in a particular time zone *that is not the local time zone* (or a time zone set in a user setting in the UI). How frequent are these use cases? – krlmlr Apr 17 '20 at 04:01
  • 1
    With one of my clients ... *all the time*. The fact that (despite this) they still hard-code time zone (or ignore it) is a mystery and source of frustration. Several times I was able to help pinpoint which machine was the culprit (or at least which office) based on this. I agree that there are libraries that do this, but they only work when they are trusted and used. Either way, I understand your point (and cannot disagree, in theory at least). – r2evans Apr 17 '20 at 04:29