Specifying different dtypes while reading large ASCII as numpy array with np.fromiter

Question

I am trying to implement the solution given in this answer to read my ~3.3GB ASCII into a ndarray.

Actually, I am getting an MemoryError when using this function against my file:

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt(fname,skiprows=1)

I am now trying to input different dtypes in the call to np.fromiter, in the hope that if most of my columns are integers and not floats I will have luck enough to avoid the Memory issue, but I had no success so far.

My file is "many rows" X 7 cols, and I'd like to specify the following formats: float for the first three cols, and uint for the following. My OS is Windows 10 64bit, and I have 8GB of RAM. I am using python 2.7 32bit.

My try was (following this answer):

data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

but I receive TypeError: expected a readable buffer object

EDIT1

Thanks to hpaulj who provided the solution. Below is the working code.

def iter_loadtxt(filename, delimiter=None, skiprows=0, dtype=float):
    def iter_func():
        dtypes = [float, float, float, int, int, int, int]
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                values = [t(v) for t, v in zip(dtypes, line)]
                yield tuple(values)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)])

    return data

data = iter_loadtxt(fname,skiprows=1)

Your very first step should be to stop using 32 bit Python and use 64 bit Python instead. This will unlock the rest of the memory on your machine. — John Zwinck, Dec 01 '16 at 12:09
Did you test this on a small file? The `iter_func` produces a stream of floats, without any grouping by line. I doubt if `from_iter` can handle a compound dtype. — hpaulj, Dec 01 '16 at 15:37
@JohnZwinck Indeed. The 64bit version of Python led me processing the whole file. Thanks. — umbe1987, Dec 06 '16 at 16:00

hpaulj · Accepted Answer · 2016-12-06T14:29:03.240

1

With a big enough input file, any code, however streamlined can hit a memory error.

With all floats your 7 column array would occupy 56 bytes; with the mixed dtype 40. Not exactly a big change. If it's hitting the memory error 1/3 of the way through the file before, it will now hit it (in theory 1/2 the way through).

iter_func reads the file, and feeds a steady stream of floats (it's own dtype). It does not return floats grouped by line. It keeps a count of lines, which is used at the end to reshape the 1d array.

fromiter can handle a compound dtype, but only if you feed it appropriate sized tuples.

In [342]: np.fromiter([(1,2),(3,4),(5,6)],dtype=np.dtype('i,i'))
Out[342]: 
array([(1, 2), (3, 4), (5, 6)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

In [343]: np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-343-d0fc5f822886> in <module>()
----> 1 np.fromiter([1,2,3,4],dtype=np.dtype('i,i'))

TypeError: a bytes-like object is required, not 'int'

Changing iter_func to something like this might work (not tested):

def iter_func():
    dtypes=[float,float,float,int,int,int,int]
    with open(filename, 'r') as infile:
        for _ in range(skiprows):
            next(infile)
        for line in infile:
            line = line.rstrip().split(delimiter)
            values = [t(v) for t,v in zip(dtypes, line)]
            yield tuple(values)
arr = np.fromiter(iter_func, dtype=[('',np.float),('',np.float),('',np.float),('',np.int),('',np.int),('',np.int),('',np.int)] )

edited Dec 06 '16 at 14:29

answered Dec 01 '16 at 16:36

hpaulj

221,503
14
230
353

Thanks, I will try this out. However, I installed a 64bit version of python and I went through the end of the process (although with a huge amount of time!). I think this would be a good way to lower the computing time consitsently according to your previous answer (which I linked in my own question). When I'll verify that this is the case, and that your new code works, I will be happy to accept your answer. – umbe1987 Dec 01 '16 at 17:14
what am I supposed to use as `dtype` argument in `arr = np.fromiter(iter_func, dtype=...)`? I am not quite sure to understand it... Sorry for aasking, but from the [function page](https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html) it's not that clear. – umbe1987 Dec 06 '16 at 12:11
I defined the `dtypes` list in `iter_func` to match the `dtype` parameter you tried in your question - the one with a mix of floats and ints. – hpaulj Dec 06 '16 at 12:29
Sorry to bother again. What is not clear for me is what am I supposed to provide to the `dtype` parameter in the `fromiter` function (instead of the `...`)? from your comments and answer, it seems to me that I have to provide `[float,float,float,int,int,int,int]`, which is however already defined within the `iter_func`, but I am probably wrong. If will add the improved code in the body of my question so that you can easily understand what I am doing. – umbe1987 Dec 06 '16 at 14:20
See my edit - I copied the dtype from your original post. – hpaulj Dec 06 '16 at 14:31
Now I receive an error `data.reshape` (`ValueError: total size of new array must be unchanged`). I am calling it like in the **EDIT1** part of my question. Will removing `data = data.reshape((-1, iter_loadtxt.rowlength))` make sense? – umbe1987 Dec 06 '16 at 15:34
Nevermind, I removed the reshape now, as (I think) it's not needed anymore. I updated my question with the working code and accepted your answer. Thanks for the help! – umbe1987 Dec 06 '16 at 15:47
Just a little update. I tested the proposed code and it's much faster than the original one. Thanks again for the efforts! – umbe1987 Dec 23 '16 at 11:17

Specifying different dtypes while reading large ASCII as numpy array with np.fromiter

1 Answers1