1

I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :

 ticker  DD/MM/YYYY    time         bid      ask
  XXX,   19122014,  08:00:08.325,  9929.00,9933.00
  XXX,   19122014,  08:00:08.523,  9924.00,9931.00
  XXX,   19122014,  08:00:08.722,  9925.00,9930.50
  XXX,   19122014,  08:00:08.921,  9924.00,9928.00
  XXX,   19122014,  08:00:09.125,  9924.00,9928.00
  …
  XXX,   30122014,  21:56:25.181,  9795.50,9796.50
  XXX,   30122014,  21:56:26.398,  9795.50,9796.50
  XXX,   30122014,  21:56:26.598,  9795.50,9796.50
  XXX,   30122014,  21:56:26.798,  9795.50,9796.50
  XXX,   30122014,  21:56:28.896,  9795.50,9796.00
  XXX,   30122014,  21:56:29.096,  9795.50,9796.50
  XXX,   30122014,  21:56:29.296,  9795.50,9796.00
  …

I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!

pnuts
  • 58,317
  • 11
  • 87
  • 139
user6396
  • 1,832
  • 6
  • 23
  • 38
  • Given your description of your solution, you could probably gain some speed at least by just processing the lines as you read them instead of reading the entire file in, the doing a sort and then processing them. So if you just checked if the time was in the valid range for each line as you read it you'd save the whole sorting step which is probably pretty slow. – Eric Renouf Apr 18 '15 at 12:22
  • How big are we talking here? you could read all the csv's into a list of df's, then `concat` them all set the index to be the time and then filter using `df.loc[(df.index.time > '09:00:00') & (df.index.time < '09:15:00')]`, whilst loading them you could sort on index and chuck away the ones that don't have that range anyway – EdChum Apr 18 '15 at 16:13
  • Yes, it would certainly work, but isn't it just slow as I did? currently I am deal with total about 4 GB amount of data, and I might work with 80GB amount of data. – user6396 Apr 18 '15 at 17:50

2 Answers2

2

Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.

Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.

So instead of just getting the whole CSV into memory, a function like such could do the job:

import csv

def filter_time(filename, mintime, maxtime):
    timecol = 3
    reader = csv.reader(open(filename))
    next(reader)
    return [line for line in reader if mintime <= line[timecol] <= maxtime]

This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo Python package - it just provides you a @paralel decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.

But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
0

My solution would be to read line by line and saving only what pass your filter:

with open('filename.csv') as fin:
    with open('fileout.csv', 'w') as fout:
        while True:
            line = fin.readline()
            if not line:
                break
            time_x = line.rstrip('\n').split(',')[2]
            # some parsing of time to do properly
            if a< time_x < b:
                fout.write(line)
Gioelelm
  • 2,645
  • 5
  • 30
  • 49
  • You should use the stdlib csv module whenever reading csv files. Manually parsing as you do here works in most cases, but is prone to errors (for example, on quoted CSV values that might contain "," themselves) and these will byt you when you expect the least. Also, since the problem here is performance, the default situation is that the integrated call on the stdlib will be faster than manually processing the strings. (one would have to proper profile the 2 methods to be sure, but take that as rule of thumb) – jsbueno Apr 18 '15 at 13:42