1

I have found a few similar questions here in Stack Overflow, but I believe I could benefit from advice specific for my case.

I must store around 80 thousand lists of real valued numbers in a file and read them back later.

First, I tried cPickle, but the reading time wasn't appealing:

>>> stmt = """
with open('pickled-data.dat') as f:
    data = cPickle.load(f)
"""
>>> timeit.timeit(stmt, 'import cPickle', number=1)
3.8195440769195557

Then I found out that storing the numbers as plain text allows faster reading (makes sense, since cPickle must worry about a lot of things):

>>> stmt = """
data = []
with open('text-data.dat') as f:
    for line in f:
        data.append([float(x) for x in line.split()])
"""
>>> timeit.timeit(stmt, number=1)
1.712096929550171

This is a good improvement, but I think I could still optimize it somehow, since programs written in other languages can read similar data from files considerably faster.

Any ideas?

erickrf
  • 2,069
  • 5
  • 21
  • 44
  • 1
    If you are storing so many lists, wouldn't a sqlite database be a better data structure? – BrtH Aug 02 '12 at 14:28
  • Did you try the `csv` module's reader yet? It would avoid the manual `split` you call. – jmetz Aug 02 '12 at 14:28
  • 1
    @BrtH a database seems like an overkill, I only need to load all these lists. – erickrf Aug 02 '12 at 15:46
  • @mutzmatron yes, it does the splitting automatically, but is a little slower than my second version. – erickrf Aug 02 '12 at 15:47
  • I would recommend mgilson's second answer - using `numpy`'s `fromfile` if your data allows it - that's likely to be one of the fastest options. – jmetz Aug 02 '12 at 15:49
  • Do you control the format of the .dat file? If so it might be worth trying the to mmap the file and use ctypes.from_buffer. In theory you save a lot of copying etc... – Tim Hoffman Aug 02 '12 at 15:58

1 Answers1

2

If numpy arrays are workable, numpy.fromfile will likely be the fastest option to read the files (here's a somewhat related question I asked just a couple days ago)

Alternatively, it seems like you could do a little better with struct, though I haven't tested it:

import struct
def write_data(f,data):
    f.write(struct.pack('i',len()))
    for lst in data:
        f.write(struct.pack('i%df'%len(lst),len(lst),*lst))

def read_data(f):
    def read_record(f):
        nelem = struct.unpack('i',f.read(4))[0]
        return list(struct.unpack('%df'%nelem,f.read(nelem*4))) #if tuples are Ok, remove the `list`.

    nrec = struct.unpack('i',f.read(4))[0]
    return [ read_record(f) for i in range(nrec) ]

This assumes that storing the data as 4-byte floats is good enough. If you want a real double precision number, change the format statements from f to d and change nelem*4 to nelem*8. There might be some minor portability issues here (endianness and sizeof datatypes for example).

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • I suspect the numpy fromfile is relatively optimized and likely to be a better solution than the struct, circumstances permitting. – jmetz Aug 02 '12 at 14:38
  • @mutzmatron -- Yeah, probably -- Although really there shouldn't be too much difference. I'd bet they do almost the same thing -- numpy has the advantage that it can put the objects in sequential memory (essentially only needing to allocate 1 block and 1 pointer doing pointer arithmatic). struct on the other hand probably will still allocate only one block, but it also needs a pointer for each float which is read, so that's a little extra work there. Either way, I would expect it to be faster than reading ascii text, converting it to floats (all done within the un-typed python framework). – mgilson Aug 02 '12 at 14:41
  • Thank you, `numpy.fromfile` took around 0.02 seconds. – erickrf Aug 02 '12 at 16:04
  • @mgilson - I would suggest you edit your answer to reflect that `numpy.fromfile` was the chosen answer - perhaps suggest `numpy.fromfile` first and add the `struct` idea as an alternative – jmetz Aug 02 '12 at 16:05