2

My starting point was a problem with NumPy's function loadtxt:

X = np.loadtxt(filename, delimiter=",")

that gave a MemoryError in np.loadtxt(..). I googled it and came to this question on StackOverflow. That gave the following solution:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt('your_file.ext')

So I tried that, but then encountered the following error message:

> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged

Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:

num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
    for line in infile:
        num_rows += 1
        tmp = line.split(",")
        if len(tmp) > max_cols:
            max_cols = len(tmp)

def iter_func():
    #didn't change

data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))

But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..) in the correct manner.

I commented the rule where date.reshape(..) is called to see what happened. That gave this error message:

> ValueError: need more than 1 value to unpack

Which happened at the first point where something is done with X, the variable where this problem is all about.

I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.

What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=","), but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."

I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by , and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..], and 100 or 100 of those items of which you see 4, for +-200.000 lines.

I'm using Python 2.7 because the open source code needs that.

Does any of you have the solution for this? Thanks in advance.

Community
  • 1
  • 1
Renzeee
  • 655
  • 1
  • 8
  • 17
  • 1
    You're using `reshape` in the correct manner. However the `count=num_rows` is a bug and causes an error with the second code. It should be the total number of values, so `count=num_rows*num_cols`. –  Oct 28 '14 at 09:37
  • Thanks, that worked out. But now I get a MemoryError in `X = np.asfortranarray(X, [..])`. The dtype is the same as I'm using in `iter_loadtxt`. It stops at +-700mb of memory which is not the end of RAM my machine can give to the process... – Renzeee Oct 29 '14 at 10:29
  • 1
    OK. I think the next problem is that you're running out of *contiguous* memory addresses, but to be honest I'm not quite sure how that works with a 32-bit process on a 64-bit OS.. At any rate, the easiest fix would be to get a 64-bit Python. Maybe ['WinPython'](http://winpython.sourceforge.net/) is nice, because it's 64-bit and portable. –  Oct 29 '14 at 10:54
  • The problem is that NumPy only works on a 32-bit Python installation on Windows, so I really can't use a 64-bit Python version. Otherwise I would've installed that right away. I assume that there is no other solution for running out on contiguous memory addresses? If not, I'll look into a different solution. – Renzeee Oct 29 '14 at 12:17
  • 1
    As far as I know 64-bit Numpy is working fine on Windows, but correct me if I'm wrong `:)` –  Oct 29 '14 at 14:24

1 Answers1

1

On Windows a 32 bit process is only given a maximum of 2GB (or GiB?) memory and numpy.loadtxt is notorious for being heavy on memory, so that explains why the first approach doesn't work.

The second problem you appear to be facing is that the particular file you are testing with has missing data, i.e. not all lines have the same number of values. This is easy to check, for example:

import numpy as np

numbers_per_line = []
with open(filename) as infile:
    for line in infile:
        numbers_per_line.append(line.count(delimiter) + 1)

# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)
  • I didn't know the first thing, thanks for that. I used your code and it says that every line has the expected_number, so that can't be the problem, unfortunately. – Renzeee Oct 27 '14 at 18:58