Generator and reading files optimization

Question

I have thousands of binary files that I have to read and store in memory to work on the data. I already have a function that permit to read those data, but I would like to improve it, because it is kind of slow.

The data are organize this way :

1000 cubes.
each cube is written in 10 binary files.

For the moment I have a reading function that can read and return ONE cube in a numpy array (read_1_cube). Then I loop over all the file to extract all the cube and I concatenate them.

def read_1_cube( dataNum ):
    ### read the 10 subfiles and concatenate arrays
    N_subfiles = 10
    fames_subfiles = ( '%d_%d'%(dataNum,k) for k in range(N_subfiles) )
    return np.concatenate( [np.fromfile( open(fn,'rb'), dtype=float, count=N*N*N ).reshape((N,N,N)) for fn in fames_subfiles], axis=2 )

TotDataNum = 1000
my_full_data = np.concatenate( [read_1_cube( d ) for d in range( TotDataNum )], axis=0 )

I try to work with generators to limit the amount of memory used. With those function it took ~2.5s per file, so 45min hour for the 1000 file, in the end I will have 10000 files, so it is not doable (of course, I will not read the 10000 files a t onces, but steel I can not work if it take 1h for 1000 files).

My questions:

do you know a way to optimize the read_1_cube and the generation of my_full_data ?
your do see a better way (without the read_1_cube) ?
An other optimization way: do you know if there is a concatenate function that can work on a generator (like sum(), min(), max(), list()... )?

Edit: Following the comment of @liborm about np.concatenate I find other equivalent functions (stack concatenate question): np.r_, np.stack, np.hstack. The good point is that stack can take a generator in input. So I push as far as possible with generator, to create the actual data array only at the end.

def read_1_cube( dataNum ):
    ### read the 10 subfiles and retur cube generator
    N_subfiles = 10
    fames_subfiles = ( '%d_%d'%(dataNum,k) for k in range(N_subfiles) )
    return (np.fromfile( open(fn,'rb'), dtype=float, count=N*N*N ).reshape((N,N,N)) for fn in fames_subfiles)

def read_N_cube( datanum ):
    ### make a generator of 'cube generator'
    C = ( np.stack( read_1_cube( d ), axis=2 ).reshape((N,N,N*10)) for d in range(datanum) )
    return np.stack( C ).reshape( (datanum*N,N,N*N_subfiles) )

### The full allocation is done here, just once
my_full_data = read_N_cube( datanum )

It is quicker than the first version, where the first version needed 2.4s to read 1 file, the second take 6.2 to read 10 files!

I think that there are not so much place for optimization, but I am sure that there is still a better algorithm out there!

Usually the most effective way for this kind of problems is to pre-alloc the resulting object and then incrementally write your data into it. All the `np.concatenate`s mean a lot of (non needed) allocations. — liborm, May 26 '17 at 12:44
Doing concatenate on sets of 10 arrays, loaded with fromfile looks reasonable. And then doing another on 1000 of those. Given the size the job I don't see room for speedup or memory savings. — hpaulj, May 26 '17 at 13:25
`sum()` can work with a generator, but `np.sum` requires an array or list that it can turn into an array. — hpaulj, May 26 '17 at 13:27
I'm confused by this talk of file and subfiles. And what kind 'file' takes 2.5s to read? It looks like the `fromfile` step is the same. — hpaulj, May 26 '17 at 18:57

score 1 · Accepted Answer · answered May 26 '17 at 21:05

1

To get a good performance (generally) you want to allocate as little as possible - this should allocate only the big array beforehand, and then each of the small ones during read. Using stack or concatenate will probably (re)alloc memory and copy data around...

I don't have the data to test it, consider this rather a 'pseudocode':

def read_one(d, i):
    fn = '%d_%d' % (d, i)
    return np.fromfile(open(fn,'rb'), dtype=float, count=N*N*N).reshape((N,N,N))

res = np.zeros((N * TotDataNum, N, N * N_subfiles))
for dat in range(TotDataNum):
    ax0 = N * dat
    for idx in range(N_subfiles):
        ax2 = N * idx
        res[ax0:ax0+N, :, ax2:ax2+N] = read_one(dat, idx)

answered May 26 '17 at 21:05

liborm

2,634
20
32

My first try was looking like this, then I get interested in 'generators' and I change it. I will try it and let you know the result. – N.G May 29 '17 at 07:59
Generators are awesome when you need something like one pass over the data, in your case imagine eg finding sum of every single file - with generators you can write this concisely and efficiently at the same time. – liborm May 29 '17 at 08:12
In the end I am using this method, it is the one that I experience as the fastest. – N.G Jun 06 '17 at 10:30

score 0 · Answer 2 · answered May 26 '17 at 17:01

I haven't fully worked out why your change has sped things up, but I doubt if it's the use of stack.

To test this:

In [30]: N=50
In [31]: arr = np.arange(N*N).reshape(N,N)
In [32]: np.stack((arr for _ in range(N*10))).shape
Out[32]: (500, 50, 50)
In [33]: np.concatenate([arr for _ in range(N*10)]).shape
Out[33]: (25000, 50)

times:

In [34]: timeit np.stack((arr for _ in range(N*10))).shape
100 loops, best of 3: 2.45 ms per loop
In [35]: timeit np.stack([arr for _ in range(N*10)]).shape
100 loops, best of 3: 2.43 ms per loop
In [36]: timeit np.concatenate([arr for _ in range(N*10)]).shape
1000 loops, best of 3: 1.56 ms per loop

The use of generator comprehension in stack has no advantage. concatenate doesn't take a generator, but is faster.

np.stack is a just a convenience cover for concatenate. Its code is:

arrays = [asanyarray(arr) for arr in arrays]
...
expanded_arrays = [arr[sl] for arr in arrays]
return _nx.concatenate(expanded_arrays, axis=axis)

It does 2 list comprehensions on the input argument, once to make sure they are arrays, and again to add a dimension. That explains why it accepts an generator list, and why it is slower.

concatenate is compiled numpy code, and expects a 'sequence':

TypeError: The first input argument needs to be a sequence

I do not think that the improvement comes from directly np.stack (as you demonstrate), but from the fact that the array is only built once at the end. I am also sorry because I mixed up 'generators' and 'list comprehension' (it is not really clear for me). — N.G, May 29 '17 at 08:02

Generator and reading files optimization

2 Answers2