3

I have a SQLite database on an external hard drive, and I would like to analyze a couple thousand subsets of it in R (or Python). The data are about 110 million observations of 13 variables, and it is stored (~10.5 GBs) on an external drive because I do not have sufficient storage on my computer. I opted to try and use a SQL database because while I do not have enough RAM to work with the entire dataset in R, I only need to access to a small percentage of it at any given time.

I built the database with RSQLite -- essentially by just appending a bunch data frames using dbWriteTable() -- and can successfully query it. The problem is that even a very simple query of the database is taking minutes to complete. For example, this would take about 3 minutes to run (in either R or with sqlite3 in Python):

# example in R
library(RSQLite)

# connect to existing database
conn <- dbConnect(RSQLite::SQLite(), EEGdata.db)

# example of simple query
testQuery <- dbSendQuery(conn, "SELECT * FROM Data WHERE Time = 42")

# return data (takes "forever" but does work)
testdf <- dbFetch(testQuery)

I've tried a number of different (but I think equivalent) ways to execute the query, for instance using dbplyr, but they all similarly take minutes to run.

I've tried using transactions (but I gather this mostly just helps for inserting or modifying information?).

I've tried using an index, which does speed up the queries, but they still take more than minute to complete.

I've tried using the latest R/RSQLite versions, as recommended in: SQLilte query much slower through RSqlite than sqlite3 command-line interface

I've tested that the general read/write speed of my external drive (WD my passport) is not horribly slow (it transfers data at about 100Mbs).

I've tried using sqlite3 through Python also, and get roughly the same very slow queries.

I imagine I could have made the database incorrectly somehow, but its puzzling why it would still function, just very poorly. Just in case, this is an example of how I created the database:

library(RSQLite)

# open connection to new database
conn <- dbConnect(RSQLite::SQLite(), EEGdata.db)

# loop over data files
for (file in filenames){

  # load file (column names defined elsewhere)
  temp <- read.delim(file = file, sep = '\t', header = F, col.names = columns) 

  # do some cleaning, merging, etc...

  # add to database
  dbWriteTable(conn, "Data", temp, append = TRUE)
}

Since the problem looks similar in both R and Python, I assume theres some issue with how I've created the SQLite database, or the fact it is stored on an external drive.

Any ideas on what might be going wrong, or what I could try?

EDITS:

Thanks for all the thoughts!

Regarding indexing, as mentioned by @Waldi and @r2evans, it does help - I did not mean to imply it doesn't, or that I wouldn't use an index. For the most complex query I would make (which is still only filtering by 3 variables), fully indexing cuts the time from ~6 minutes to ~1.5 minutes. So obviously the indices are worth using, but it alone doesn't seem to account for the slowness.

Regarding a smaller version of the database (@DavidP), good idea - I just tried a few things and found that in this case, the actual size of the database doesn't matter much for speed, but rather the size of the data being returned from the query (particularly the number of rows). But even in smaller cases, it feels slow - for instance returning just 900 rows, with everything fully indexed, takes 1.3 seconds. Which from what I can tell is still quite slow for a simple SQL query.

Regarding the speed of the HD (@Waldi and @dnoeth) - I suppose I didn't want to believe this was just a hardware issue, but thats definitely possible. I'll see if I can find a SSD somewhere to test this on and report back.

More information for @Parfait (and others): '''

sqlite> .schema Data
CREATE TABLE `Data` (
  `Item` INTEGER,
  `Amplitude` REAL,
  `Electrode` INTEGER,
  `Time` INTEGER,
  `Subject` TEXT,
  `Hstatus` TEXT,
  `Word` TEXT,
  `Zipf_freq` REAL,
  `OLD` REAL,
  `Concreteness` REAL,
  `Visual_complexity` REAL,
  `Letter_number` REAL,
  `Morpheme_number` REAL
);
CREATE INDEX time_idx ON Data (Time);
CREATE INDEX idx ON Data (Hstatus, Electrode, Time);

'''

Regarding not using SELECT * - I tried only selecting a single variable instead of all, and it did not really improve speed. (and in this application, I've already cut out the information I don't need, so except for the index variables, I need to select everything).

kwinsler
  • 31
  • 4
  • 1
    An index on `Time` seems sensible : you didn't see any improvement? The bottleneck is probably the read speed on the external HD. – Waldi Sep 28 '20 at 18:45
  • Yes, adding an index does not give much of an improvement (at least not on the scale of minutes desired). I agree that seems to indicate a bottleneck on the HD, but from what I can tell (from testing with an application - Black Magic), the read speed from my HD is about 100mbs, which shouldn't be such an extreme bottleneck. – kwinsler Sep 28 '20 at 19:01
  • Have you tried making a smaller database file (some subset of the source files) and comparing the performance? If a database half the size takes the same time, it would seem to me that it's more likely the drive. I'm not as familiar with sqlite, does it need to read the entire database file or a large set of it to "startup" the database? That could tank your single query performance. – DavidP Sep 28 '20 at 20:23
  • 1
    *the read speed from my HD is about 100mbs* Seems to be a HDD. Well, this is probably sequential read of larger blocks, but the access pattern of a DBMS is usually not sequential, but random (even if you're the only user, datablocks are probably not in sequential order). Switching to a SSD greatly improves IOs per second. – dnoeth Sep 28 '20 at 20:27
  • In case you missed it when you read your [linked RSQLite recommendation](https://stackoverflow.com/questions/5145054/sqlilte-query-much-slower-through-rsqlite-than-sqlite3-command-line-interface), that advice was from over 9 years ago. There have been several changes to SQLite, R, DBI, and RSQLite since then that render its advice a bit faint. Without the ability to test, I'm fairly confident (nearing certain) that indexing will have a significant effect on speed. To really advise on that would require a bit more information of each column and your anticipated queries (if more than `select *`. – r2evans Sep 28 '20 at 20:31
  • 1
    Alternatively, while SQLite is a behemoth and stable system, a newcomer [DuckDB](https://duckdb.org/) has shown some speed advantages over SQLite. Conveniently, it can work natively with `DBI` (though can also work independently), so your code should not require changes (just transfer of data from file to file). – r2evans Sep 28 '20 at 20:33
  • Please show the schema of table. With CLI: `.schema Data`. As an initial tip, avoid [`SELECT * FROM`](https://stackoverflow.com/q/3639861/1422451) and explicitly select needed columns. You may even be able to localize the problem column. – Parfait Sep 28 '20 at 20:41
  • Thanks everyone for your thoughts! I've added more information and responses as an edit in the question. – kwinsler Sep 28 '20 at 22:36
  • You created a multi-column index on `(Hstatus, Electrode, Time)`, have you added those to your query? (An index only works if the query includes the columns in some filtering capacity.) – r2evans Sep 29 '20 at 05:18
  • Yes, in practice my queries would be filtering on those 3 columns. With the multi-column index, the query still takes about 1.5 minutes. – kwinsler Sep 29 '20 at 15:29
  • 1
    You need to use transactions. Commit only after adding many rows, it will increase write speed significantly – mvp Sep 29 '20 at 19:36

1 Answers1

3

(aside from the external HD, which is definitely a bottleneck) I had a similar issue and it was down to sqlite database itself. You can switch to DuckDB and use the same query code. For me it was more than 10x faster on a slightly larger file with 100 million rows. https://duckdb.org/docs/api/r

HCAI
  • 2,213
  • 8
  • 33
  • 65