Recording high frequency data without dropping packets

Question

I am using QuickFix with python bindings, along with pandas for data management.

I have been dealing with this issue for awhile, and have not found any clear questions/answers relating to it on SO or Google. It relates to code efficiency and architecture in a low-latency environment.

I am recording financial data. It is extremely high frequency. During fast periods, (small-sized) messages arrive every 15 milliseconds or so. QuickFix passes each message to a message cracker that I wrote, which does the following:

parses the message with re
convert datatype of each element of the message (about 8 of them in this case)
update the values of a pandas data frame with the 8 elements
open a .csv file on the local computer, append the line of 8 elements, close the file.

A pretty simple process, but multiply this by several hundred markets, and what happens is my computer is unable to deal. Anywhere between 2 - 100 times per day, the computer chokes, falls offline, and I lose about 20 seconds of data (c. 13,000 samples!)

I am presently looking at PyTables, to see if I can speed up my process. But I do not know enough about computer science to really get to the heart of the speed issue, and would appreciate some wisdom.

Is the problem the .csv file? Can I use PyTables and HDF5 to speed things up? What would be the 'right' way of doing something like this?

It could well be that Python is simply not the best tool for this job. — NPE, Feb 19 '15 at 15:20
Is the .csv file absolutely necessary? Is it critical to update that file every time a message is cracked and parsed? Out of the four steps listed, the cost of opening the file, appending data, and closing the file seems most likely to be costly. — hunch_hunch, Feb 19 '15 at 15:21
Updating a df with 8 element values should be quick but if you are appending repeatedly this will be slow, it sounds like you are just using pandas to store the values rather than performing any operations on the values — EdChum, Feb 19 '15 at 15:23
@hunch_hunch The .csv file is not absolutely necessary, but I need to record the data somehow. Thus the switch to hdf5. It is not necessary to update the file every time the message is cracked and parsed, no. So long as I don't lose data I can update relatively rarely. That might be the best solution. — Wapiti, Feb 19 '15 at 20:15
@EdChum Yes, at present just storing the *latest* value in a df and not doing much with it. Eventually I will need to perform lots of operations on these values in real time as well. This is part of my attempt to speed everything up as much as possible now. — Wapiti, Feb 19 '15 at 20:19

score 1 · Accepted Answer · edited May 23 '17 at 11:56

Try writing the data to an in memory database or queue, or a time series database, then persist to disk later. Try something from in mem databases or time series databases

Or simply try writing the time series data to MSMQ or another MQ.

I now read that Pandas is a kind of in memory database for Python. Had not heard of it. There are a lot in those lists above! Further, to answer some of your questions, the right way to do this is think about each operation on each "layer" of the price persistence. First of all, your network IO operations are going to be more expensive than your CPU. Secondly, your network will be prone to data storms where the bandwidth is overwhelmed. You need a big data pipe for the bursts.

So, each price message arrives at your network interface, has to get through the TCP stack (presumably) before it hits your CPU. UDP is quicker than TCP but if you're using QuickFIX you'll be using TCP.

Then as the datagram is unwrapped through the OSI layers it gets to your CPU which starts to cycle through the instructions, assuming the CPU isn't busy elsewhere. Here, basically if you have any IO (to disk or anywhere) your CPU is going to spin waiting for IO to complete. The OS may context switch to another process if your process priority is too low. Alternatively, your threading model may be causing too many context switches, it's better to use a single thread for this kind of thing so there aren't any.

The "right" way to do this is you want to get the data from the NIC, through the CPU and into an MQ or memory space. Then, you can bulk write to a database later... (every minute or whatnot YOU DECIDE)

Great stuff here, thanks. Some questions: Is there a FIX engine that doesn't use TCP? (Pandas is pretty useful for a lot of things, I recommend it.) Any suggestions for a python MQ? I found this: https://pypi.python.org/pypi/kombu/ . My feeling was that I should do it the way you outlined in your last paragraph. It seems I should be able to have a rolling window of most recent in-memory data which, whenever it becomes old, is saved by the data management package, not my own code. Does this exist? I don't know much about databases, as you can tell. — Wapiti, Feb 24 '15 at 01:56
I gather [openfast](http://www.openfast.org) can use UDP but I don't know of a UDP FIX implementation probably because orders and execution reports are risk critical, so you need guaranteed delivery. For the rolling window of in memory data, use an array or collection of x quotes, as many as you need, and add to the top, like. (remove from the end and persist once useless quote but watch the cost of the DB call. Need another thread for persistance probably.) — rupweb, Feb 25 '15 at 17:13

Recording high frequency data without dropping packets

1 Answers1