4

The schema

The sqlite database described below is currently 52 GB. It was generated as part of an Entity Framework Core 6.0 project.

The Trades table

CREATE TABLE IF NOT EXISTS "Trades" (
    "Id"        INTEGER NOT NULL CONSTRAINT "PK_Trades" PRIMARY KEY AUTOINCREMENT,
    "SymbolId"  INTEGER NOT NULL,
    "Price"     TEXT    NOT NULL,
    "Quantity"  TEXT    NOT NULL,
    "TimeStamp" TEXT    NOT NULL,
    "Side"      INTEGER NOT NULL,
    "Type"      INTEGER NOT NULL,
    CONSTRAINT "FK_Trades_Symbols_SymbolId" FOREIGN KEY ("SymbolId") REFERENCES "Symbols" ("Id") ON DELETE CASCADE
);

This holds trades from the Kraken cryptocurrency exchange.

The data was populated from CSVs downloaded here:

https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales

The most recent trades were retrieved via API.

Data excerpt:

sqlite> SELECT * FROM Trades LIMIT 5;
Id        SymbolId  Price  Quantity   TimeStamp            Side  Type
--------  --------  -----  ---------  -------------------  ----  ----
17272735  29        2.364  29.18557   2021-08-10 15:31:26  0     0
17272736  29        2.364  12.50281   2021-08-10 15:31:34  0     0
17272737  29        2.363  416.76043  2021-08-10 15:31:36  0     0
17272738  29        2.365  38.94156   2021-08-10 15:32:00  0     0
17272739  29        2.363  0.41676    2021-08-10 15:32:05  0     0

The current size of the table:

sqlite> SELECT COUNT(*) FROM Trades;
COUNT(*)
---------
535560814
Run Time: real 7.746 user 0.984375 sys 6.750000

The Symbols table

CREATE TABLE IF NOT EXISTS "Symbols" (
    "Id" INTEGER NOT NULL CONSTRAINT "PK_Symbols" PRIMARY KEY AUTOINCREMENT,
    "Name" TEXT NOT NULL
);

Each trading pair has a symbol. E.g. XBTUSD for Bitcoin.

Data excerpt:

sqlite> SELECT * FROM Symbols LIMIT 5;
Id  Name
--  --------
29  1INCHEUR
30  1INCHUSD
31  AAVEAUD
32  AAVEETH
33  AAVEEUR

Some key indices

Indices on some critical columns:

CREATE UNIQUE INDEX "IX_Symbols_Name"     ON "Symbols" ("Name");
CREATE        INDEX "IX_Trades_SymbolId"  ON "Trades"  ("SymbolId");
CREATE        INDEX "IX_Trades_TimeStamp" ON "Trades"  ("TimeStamp");

Query performance

Most of the basic queries I've run so far seem to be quite slow. Here are some examples.

10 most recent XBTUSD trades (31 seconds)

SELECT * 
FROM Trades 
WHERE SymbolId = (SELECT Id FROM Symbols WHERE Name = 'XBTUSD') 
ORDER BY TimeStamp DESC 
LIMIT 10;

Run Time: real 31.641 user 28.984375 sys 2.640625

Id         SymbolId  Price    Quantity    TimeStamp                Side  Type
---------  --------  -------  ----------  -----------------------  ----  ----
552833548  377       48683.1  0.157       2021-12-06 06:16:17.804  1     0
552833547  377       48681.5  0.00049872  2021-12-06 06:16:04.855  0     0
552833546  377       48681.4  0.32        2021-12-06 06:16:02.729  1     0
552833545  377       48680.4  0.1         2021-12-06 06:15:53.826  0     0
552833544  377       48697.4  0.05        2021-12-06 06:15:39.56   1     1
552833543  377       48694.5  0.04923492  2021-12-06 06:15:22.787  0     0
552833542  377       48690.2  0.00210422  2021-12-06 06:15:22.786  0     0
552833541  377       48689.6  0.02201001  2021-12-06 06:15:22.487  0     0
552833540  377       48682.5  0.00052279  2021-12-06 06:15:22.486  0     0
552833539  377       48652.3  0.00011573  2021-12-06 06:15:22.485  0     0

Show high and low by month for XBTUSD (64 seconds)

SELECT 
    strftime('%Y-%m', TimeStamp)  AS date, 
    COUNT(*)                      AS trade_count, 
    min( CAST(Price AS DECIMAL) ) AS high, 
    max( CAST(Price AS DECIMAL) ) AS low 
FROM Trades 
WHERE 
    SymbolId = (SELECT Id FROM Symbols WHERE Name = 'XBTUSD') 
GROUP BY strftime('%Y-%m', TimeStamp)
ORDER BY date;

Run Time: real 64.743 user 55.421875 sys 8.625000

date     trade_count  high       low
-------  -----------  ---------  ---------
2013-10  422          122        207.30246
2013-11  7147         201.04     1198.8009
2013-12  14964        380        1130
2014-01  7747         737.15115  1019
2014-02  8990         450        835
2014-03  4314         436.07375  720
2014-04  3063         353.75804  583.75327
2014-05  2046         422.19     629.0355
2014-06  2533         525        677.90898
2014-07  838          566.75121  658.87046
2014-08  1756         455        602.36146
2014-09  930          295        527.99987
2014-10  1188         273.32564  418.98998
...

Number of trades by symbol (53 seconds)

SELECT Symbols.Name, SymbolId, COUNT(*) AS trade_count
FROM Trades JOIN Symbols ON Trades.SymbolId = Symbols.Id
GROUP BY SymbolId
ORDER BY trade_count;

Run Time: real 53.176 user 46.031250 sys 7.031250

Name       SymbolId  trade_count
---------  --------  -----------
WBTCXBT    369       561
RAYUSD     291       674
WBTCUSD    368       1003
YFIAUD     407       1460
GHSTGBP    167       1488
WBTCEUR    367       1860
FILAUD     154       2036
RAYEUR     290       2154
BNTGBP     88        2437
GRTAUD     174       2514
ZRXGBP     417       2732
BNTXBT     90        3196
SUSHIGBP   333       3387
...

The question

Do you have any suggestions for how to improve the performance of the queries shown above?

I.e. would you recommend additional indices? Or perhaps there are some sqlite specific parameters I should take a look at?

Or would you say that, based on the size of the database and the fact that it's sqlite, queries like this are just going to be slow. If that's the case, I guess I can consider using SQL Server or Postgresql. I like the idea of sqlite since it has the lowest overhead as far as setup goes. It's also free (unlike SQL Server).

Notes

Although I'm showing sqlite queries above, the database and schema were created via an Entity Framework Core 6.0 project.

The project which includes code for the EF Core database as well as programs which import the trade data from CSV and API is available here:

https://github.com/dharmatech/kraken-trades-database

The entity models and database context:

https://github.com/dharmatech/kraken-trades-database/blob/008-import-api/KrakenTradesDatabase/Model.cs

Code that imports the CSV (if you're curious):

https://github.com/dharmatech/kraken-trades-database/blob/008-import-api/KrakenImportCsv/Program.cs

Code that imports via API (if you're curious):

https://github.com/dharmatech/kraken-trades-database/blob/008-import-api/KrakenImportApi/Program.cs

TEXT columns

Lasse asks a good question in the comments below:

is there a reason why your price and amount was typed to text and not to REAL

He's referring to these columns in the Trades table:

"Price"     TEXT    NOT NULL,
"Quantity"  TEXT    NOT NULL,

That's what EF Core generated from the following C# model class:

[Index(nameof(TimeStamp))]
public class Trade
{
    public int Id { get; set; }

    public int SymbolId { get; set; }

    public Symbol Symbol { get; set; } // navigation property
    public decimal Price { get; set; }
    public decimal Quantity { get; set; }
    public DateTime TimeStamp { get; set; }

    public OrderSide Side { get; set; }
    public OrderTypeMinimal Type { get; set; }
}
dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • 2
    Wow, that's what I call well-prepared question! +1 from me. First, try to get rid of subqueries and use joins. That might help because often subqueries are hard for optimizers to build an optimal plan – ekochergin Dec 06 '21 at 07:56
  • 3
    well, SQLite _is_ designed and optimised to portably handle _small_ databases. (opinionated suggestion: we use postgres for every project, except when the customer forces us to use SQL Server. which has a free "Express" version, which is size limited) you don't provide any numbers, but i'm guessing your dataset is rather large? also: indices. yes. on any column that you sort or filter or group by. – Franz Gleichmann Dec 06 '21 at 07:56
  • 1
    `SymbolId = (SELECT Id FROM Symbols WHERE Name = 'XBTUSD')` about how many rows do you think/know this criteria would match? You're running a strftime for each row, as well as two casts per row to grab min/max, is there a reason why your price and amount was typed to text and not to `REAL`? – Lasse V. Karlsen Dec 06 '21 at 08:01
  • 1
    It doesn't matter how the database was created - by EF Core or by hand. What matters is the schema and the indexes. The order in `ORDER BY TimeStamp DESC` is the *reverse* of `IX_Trades_TimeStamp` – Panagiotis Kanavos Dec 06 '21 at 08:01
  • 1
    SQLite is an embedded database - this means it runs inside your application, using the application's RAM. It doesn't have a complex query optimizer nor can it take advantage of parallelism, advanced buffering etc. A server-based database would produce the same execution plan for both `SymbolId = (SELECT Id FROM Symbols WHERE Name = 'XBTUSD')` and `INNER JOIN Symbols s on s.Id=SymbolId` but SQLite? Similarly, a full database could use an index even for `ORDER BY TimeStamp desc` but are you sure SQLite can do this? – Panagiotis Kanavos Dec 06 '21 at 08:05
  • 1
    A server has far more power and RAM than SQLite and can aggressively cache indexes and data. That's why `ORDER BY Timestamp desc` may not matter that much to a server - the data page may already be buffered. SQLite on the other hand uses your application's RAM and can't load too much. It's built to run on constrained devices after all, so it can't buffer aggressively. – Panagiotis Kanavos Dec 06 '21 at 08:07
  • @LasseV.Karlsen, That's a good question. I've updated the question to answer this. See the section at the end titled *TEXT columns*. – dharmatech Dec 06 '21 at 08:07
  • @FranzGleichmann, The database is 52 GB. The `Trades` table has 535,560,814 rows. Yes, I have my eye on Postgresql. ;-) I just wanted to see if sqlite was feasible before I make a larger database a requirement. – dharmatech Dec 06 '21 at 08:11
  • 1
    `That's what EF Core generated from the following C# model class:` you don't have to use this if you don't like it, although it makes some sense. SQLite doesn't even have types. `REAL` and `TEXT` aren't types, they're storage classes. `NUMERIC` is a facet that can be applied to any storage class, including `TEXT`. `REAL` has all the precision problems of floating point numbers while `decimal` is used to avoid such problems in the first place. So while the correct facet" for `decimal` would be `NUMERIC`, `REAL` would probably *not* be a good idea – Panagiotis Kanavos Dec 06 '21 at 08:12
  • 1
    @dharmatech well, anything that's measured in gigabytes at all is something i would _never_ store in SQLite. definitely switch to a full-sized RDBMS. – Franz Gleichmann Dec 06 '21 at 08:13
  • 1
    @dharmatech `The database is 52 GB` and handled by a single core in your application. Whether it's feasible or not depends on what you want to do with it, and taking care to use *precisely* the right schema and indexes. You can't expect it to both handle ingest of trades *and* analysis. These require very different indexes – Panagiotis Kanavos Dec 06 '21 at 08:14
  • @PanagiotisKanavos Excellent point where you say *You can't expect it to both handle ingest of trades and analysis.* I agree. So this database is mostly for analysis, not realtime trade ingestion. – dharmatech Dec 06 '21 at 08:16
  • 1
    @dharmatech for ingest, the current timestamp index is OK, as new records will just modify the tail of the index. If the timestamp index was reversed, you'd be inserting at the head and had to move other index pages to make space. That would be bad. But that means that unless SQLite's optimizer is smart enough, you can't just ask for the last 10 trades per symbol. – Panagiotis Kanavos Dec 06 '21 at 08:16
  • 2
    `So this database is mostly for analysis,` in that case SQLite is **completely inappropriate**. Many of the operations you'll perform require moving back and forth, scanning sequences of trades, calculating values in windows. You *do* need to load as much as possible in memory, you *do* need advanced indexes, you do need parallelization, not just multi-user execution etc. If the database was smaller, SQL Server Express would be an option as it offers columnstore indexes that can speed up processing a lot. At 52 GB though, you need lots of RAM, CPU *and* IO. – Panagiotis Kanavos Dec 06 '21 at 08:23
  • @PanagiotisKanavos Yeah, I was looking at the size limits of SQL Server. It looks like LocalDB is limited to 10 GB. I'll likely take a look at Postgresql. Thank you for your suggestions! – dharmatech Dec 06 '21 at 08:25
  • 1
    Some of the measures you calculate though can easily be calculated while importing the data. Calculating a median for a symbol for example can't be done without actually reading all trades. Why not calculate it while importing the data then? Of course, that means you can't just use `COPY` to import CSVs, you'll have to write a job that calculates such measures while loading the data. – Panagiotis Kanavos Dec 06 '21 at 08:28
  • @PanagiotisKanavos, Well, importing the data was a whole separate adventure. :-) That was discussed and solved in [this question](https://stackoverflow.com/questions/70207939/importing-a-1-3gb-csv-file-into-sqlite-via-ef-core). Sqlite has held up quite well in that aspect! – dharmatech Dec 06 '21 at 08:29
  • 1
    @dharmatech LocalDB is another embedded DB and a feature of SQL Server Express. If you have Express, you don't need LocalDB unless you explicitly need an embedded DB. You *could* use Express if you could split your DB into multiple ones, but whether that makes sense depends on your specific requirements. There are ways to use columnstores in PostgreSQL as well. And not all trading calculations can benefit from a columnstore either. Sums, counts, averages, min,max, will be 1K faster. Medians, could be *slower* because you'd have to "unpack" the columns into individual rows again. – Panagiotis Kanavos Dec 06 '21 at 08:32
  • @PanagiotisKanavos I've switched over to using PostgreSQL. Yes, the queries are much faster. :-) Interestingly, the data ingestion is maybe 20% slower than sqlite. This version is in the same above linked github project. – dharmatech Dec 16 '21 at 21:25
  • @FranzGleichmann I've switched over to using PostgreSQL. Yes, the queries are much faster. :-) Interestingly, the data ingestion is maybe 20% slower than sqlite. This version is in the same above linked github project. – dharmatech Dec 16 '21 at 21:26

0 Answers0