39

I'm creating a database schema for storing historical stock data. I currently have a schema as show below.

My requirements are to store "bar data" (date, open, high, low, close volume) for multiple stock symbols. Each symbol might also have multiple timeframes (e.g. Google Weekly bars and Google Daily bars).

My current schema puts the bulk of the data is in the OHLCV table. I'm far from a database expert and am curious if this is too naive. Constructive input is very welcome.

CREATE TABLE Exchange (exchange TEXT UNIQUE NOT NULL);

CREATE TABLE Symbol (symbol TEXT UNIQUE NOT NULL, exchangeID INTEGER NOT NULL);

CREATE TABLE Timeframe (timeframe TEXT NOT NULL, symbolID INTEGER NOT NULL);

CREATE TABLE OHLCV (date TEXT NOT NULL CHECK (date LIKE '____-__-__ __:__:__'),
    open REAL NOT NULL,
    high REAL NOT NULL,
    low REAL NOT NULL,
    close REAL NOT NULL,
    volume INTEGER NOT NULL,
    timeframeID INTEGER NOT NULL);

This means my queries currently go something like: Find the timeframeID for a given symbol/timeframe, then do a select on the OHLCV table where the timeframeID matches.

nall
  • 15,899
  • 4
  • 61
  • 65
  • Not really sure what the question is here? Code review? – randomx Oct 06 '09 at 04:49
  • 4
    The question is: "Is this a reasonable design when you consider large data sets or should it be rethought?" – nall Oct 06 '09 at 04:54
  • Can you please provide the schema diagram for this. What database are you using. I am having a similar situation. – Arun Raja Nov 12 '14 at 03:44
  • Can you share the final schema that you came up with? – Sisir Jan 17 '15 at 07:28
  • There is a very rich discussion about it on quant.stackexchange: https://quant.stackexchange.com/questions/29572/building-financial-data-time-series-database-from-scratch I really recommend it. – Valdek Santana Apr 10 '21 at 13:53

3 Answers3

46

We tried to find a proper database structure for storing large amount of data for a long time. The solution below is the result of more than 6 years of experience. It is now working flawlessly for our quantitative analysis.

We have been able to store hundreds of gigabytes of intraday and daily data using this scheme in SQL Server:

 Symbol -  char 6
 Date -  date
 Time -  time
 Open -  decimal 18, 4
 High -  decimal 18, 4
 Low -  decimal 18, 4
 Close -  decimal 18, 4
 Volume -  int

All trading instruments are stored in a single table. We also have a clustered index on symbol, date and time columns.

For daily data, we have a separate table and do not use the Time column. Volume datatype is also bigint instead of int.

The performance? We can get data out of the server in a matter of milliseconds. Remember, the database size is almost 1 terabyte.

We purchased all of our historical market data from the Kibot web site: http://www.kibot.com/

boe100
  • 619
  • 7
  • 6
  • 2
    So how do you cater for stock splits? – Lydon Ch Jun 26 '10 at 04:19
  • 5
    We track splits and dividends on a daily basis and delete and then bulk insert data for every symbol that needs to be changed. – boe100 Jul 30 '10 at 14:39
  • You answer was very helpful. Thanks. – Brad Lucas Feb 06 '12 at 23:03
  • I would avoid Kibot data, if you search around online, you will find lots of forum posts about problems. Here's an example: https://www.quantnet.com/threads/free-historical-intraday-stock-data.3275/#post-60134 – user788171 Sep 21 '12 at 21:54
  • 2
    @boe100 I'm also considering this "flat" db schema. One doubt I have - assuming 10k tickers, to store daily data for 5 years, you will have 10k * 365 * 5 = 18,250,000 rows in the single table. Does your database handle this large table? May I know what database solution you are using, and if you have done table partitioning? Thanks! – KFL Nov 14 '13 at 01:59
  • 2
    @boe100 can you send the full structure of the database. As we have to consider the fundamental data, market data etc. How does it take into consideration of scalability if we need to add many tables in future and link it. If possible could you share your mail id so that I can get some guidance from you. – Arun Raja Nov 12 '14 at 04:08
  • Can you explain why you split date and time? why not just use datetime? – Sisir Jan 17 '15 at 09:09
  • Can you provide the full schema for an idea? – Arun Raja Aug 14 '15 at 03:00
  • Sir, I try to implement your ways. I found that "select * from stock_table where symbol=xxx" is fast, but "select * from stock_tablewhere date=xxx" is quite slow. Is that true? – Xu Hui Oct 29 '21 at 12:19
  • @XuHui make sure you have the `index` set – matt Jul 18 '22 at 19:34
32

Well, on the positive side, you have the good sense to ask for input first. That puts you ahead of 90% of people unfamiliar with database design.

  • There are no clear foreign key relationships. I take it timeframeID relates to symbolID?
  • It's unclear how you'd be able to find anything this way. Reading up on abovementioned foreign keys should improve your understanding tremendously with little effort.
  • You're storing timeframe data as TEXT. From a performance as well as a usability perspective, that's a no-no.
  • Your current scheme can't accommodate stock splits, which will happen eventually. It's better to add one further layer of indirection between the price data table and the Symbol
  • open, high, low, close prices are better stored as decimal or currency types, or, preferably, as an INTEGER field with a separate INTEGER field storing the divisor, as the smallest price fraction (cents, eights of a dollar, etc.) allowed varies per exchange.
  • Since you support multiple exchanges, you should support multiple currencies.

I apologise if all of this doesn't seem too 'constructive', especially since I'm too sleepy right now to suggest a more usable alternative. I hope the above is enough to set you on your way.

Tom H
  • 46,766
  • 14
  • 87
  • 128
Michiel Buddingh
  • 5,783
  • 1
  • 21
  • 32
  • This is very useful, thanks. The timeframe data is easy enough to turn into an INTEGER, though I'm not sure what you mean by it being a no-no from a usability perspective (internally I have an object representation). – nall Oct 06 '09 at 05:26
  • 2
    Nall - consider the volumes when looking at INTERGER in database is the size of the datatype going to be enough? –  Oct 06 '09 at 08:35
  • 1
    Many databases support fixed-precision decimal types now, such as Postgres's NUMERIC type. This would be better than storing two INTEGERs for the price, as it keeps the price data in one column while avoiding floating-point rounding errors. – Paul Legato Mar 24 '12 at 21:13
4

I'm not sure what value is added by Timeframe - it seems like an unnecessary complication, but that could be something I'm failing to understand ;-) Can a Timeframe have more than one OHLCV? If not, then I'd suggest they be merged.

I would note also that stock tickers change from time to time for any number of reasons. It's not a frequent event, but it happens. If you're thinking about working with your data as time series, you should be aware of the issue so that you can handle it when it comes, if not before. If you're not tracking stocks (you may be working on a futures app, say) then this advice may be taken with the appropriate amount of salt.

Again mostly relevant to stocks, splits have been mentioned elsewhere and you may want to consider dividends - a stock's price will typically fall by the dividend amount (or more accurately the present value thereof) on the ex-dividend date, which may be misinterpreted if you don't know a confirmed future cash flow was the reason. Rights issues can be fun, too.

If you're planning on looking at series of data for a particular symbol, I'd suggest looking into what sort of performance you're going to get. At the very least, make sure you have an appropriate index in place.

Mike Woodhouse
  • 51,832
  • 12
  • 88
  • 127
  • There are multiple entries in the OHLCV table for each symbol+timeframe combination. Thanks for the symbol change observation. – nall Oct 06 '09 at 14:49