I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
- Timestamp including date,day,time
- Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
- Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs