14

I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:

# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
    for channel in channel_names:
        channel_content[channel]['recording'].extend(
            [struct.unpack( "h", f.read(2))[0]
            for iteration in xrange(int(channel_content[channel]['nsamples']))])

With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).

What I would like to know:

  1. If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
  2. If this reading speed is normal, or normal to Python, and if reading speed
  3. If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
  4. (of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.

Thanks for reading

(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)

Edit: channel_names is a list, so I made the correction suggested by @eumiro (remove typoed brackets)

Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.

Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:

fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(
                                                fullsamples[position:position+samples])
        position += samples

The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.

martineau
  • 119,623
  • 25
  • 170
  • 301
heltonbiker
  • 26,657
  • 28
  • 137
  • 252
  • 2
    Are you sure about `for channel in [channel_names]:`? This runs one loop with `channel = channel_names`. Also, where do you get all the values within `channel_content[channel]`? – eumiro Apr 27 '11 at 12:29
  • 2
    Did you try to profile your code? – 9000 Apr 27 '11 at 12:43
  • @eumiro: you are right, there should not be the brackets, since channel_names is actually a list of strings (the names themselves). – heltonbiker Apr 27 '11 at 13:00
  • @9000: I did some "amateur" profiling with a lot of `print "starting this"` and `print "ending this"`, but I confess I do not know for sure how to properly profile, much less what to do to increase performance given the profiling results. – heltonbiker Apr 27 '11 at 13:04
  • Did you mean `(os.path.getsize(f.filename) - f.tell()) / fullsamples.itemsize`? – jfs Apr 27 '11 at 20:49

5 Answers5

16

You could use array to read your data:

import array
import os

fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)

It is 40x times faster than struct.unpack from @samplebias's answer.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • nothing like experts! I will obviously test your suggestion, very soon we'll know the results. The only "problem" is that not all the file should be read, only from byte 'N' to the end, 'N' being some number described in the file header. Many thanks! – heltonbiker Apr 27 '11 at 14:42
  • 1
    @heltonbiker: You could use `.fromfile()` to read from any place in a file as long as you give it a file object (you can read from it before/after `.fromfile()` call) and number of items to read. – jfs Apr 27 '11 at 14:52
  • I voted in your answer as my chosen one, because `array` is described as "efficient" in the docs, and it has the `fromfile()` method. But I wonder (since there are four channels with short intervals alternately recorded in the file) if reading all the file at once and then slicing the big array to the proper channels would be better than initializing four arrays and reading the proper sections directly into them. Now I'll be out for lunch, but soon I post the result. Thanks again, and thanks also to anyone who contributed. – heltonbiker Apr 27 '11 at 15:06
  • @heltonbiker: `array.array()` supports slicing and you can call `.fromfile()` multiple times. You could try either way and see what is faster in your case. – jfs Apr 27 '11 at 15:18
  • It seems calling `.fromfile()` multiple times became slower than the `struct` approach. I will try creating a single array and slicing from it to extend each channel's array. – heltonbiker Apr 27 '11 at 15:24
7

If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:

data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices

A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.

samplebias
  • 37,113
  • 6
  • 107
  • 103
  • This looks like a very fine suggestion, but I couldn't quite capture the syntax of struct() arguments. I'll take a look at the docs. – heltonbiker Apr 27 '11 at 13:07
  • 1
    From the [struct docs](http://docs.python.org/library/struct.html#format-characters): "A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'." – samplebias Apr 27 '11 at 13:33
  • `array.array` allows you to read a 20M file 40x time faster http://stackoverflow.com/questions/5804052/improve-speed-of-reading-and-converting-from-binary-file-with-python/5805696#5805696 – jfs Apr 27 '11 at 14:34
  • @JF definitely, `array.fromfile` is even faster. – samplebias Apr 27 '11 at 14:45
2

A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.

In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:

Creating a Struct object once and calling its methods is more efficient than calling the struct functions with the same format since the format string only needs to be compiled once.

So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:

chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
    chunkdata = structobj.unpack(fileobj.read(structobj.size))

(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)

Karim Bahgat
  • 2,781
  • 3
  • 21
  • 27
1

Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:

s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)
Charles Brunet
  • 21,797
  • 24
  • 83
  • 124
1

extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list

There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something

What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?

Where do number_of_intervals come from ? What is the nature of the intervals ?

In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?

I read in the doc:

class struct.Struct(format)

Return a new Struct object which writes and reads binary data according to the format string format. Creating a Struct object once and calling its methods is more efficient than calling the struct functions with the same format since the format string only needs to be compiled once.

This expresses the same idea as samplebias.

If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument

.

EDIT

I propose

channel_content = {}
for rec in xrange(number_of_intervals)):
    for channel in channel_names:
        N = int(self.channel_content[channel]['nsamples'])
        upk = str(N)+"h", f.read(2*N)
        channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)

I don't know how to take account of the J.F. Sebastian's suggestion to use array

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • your attention is much welcome, thanks for it. `self.channel_content[channel]['nsamples']` is a string read from ascii field in the file header, so as `number_of_intervals`. You were right about `channel_content`, that line should not be there. The brackets were put inside `extend()` because it was a list comprehension, but now I am creating `values` separately, according to @samplebias suggestion. I will study the Struct() object as you suggested. Thank you very very much. – heltonbiker Apr 27 '11 at 14:10
  • as of `rec in xrange(number_of_intervals)`, you guessed right: `rec` is just a dummy variable, the idea is to loop `number_of_intervals` times, because it gets (has to get, according to fileformat spec) the correct file size. It is working fine, specially after all these modifications suggested here. – heltonbiker Apr 27 '11 at 14:29
  • surely I will make his suggestion to work, since the very description of array as an "efficient" alternative to list, plus the ability to read directly from file, makes it a very very strong candidate, as Sebastian had put it himself. – heltonbiker Apr 27 '11 at 15:01