13

I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)

I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.

I've since tried SEVERAL other tactics, including:

  1. import fileinput and read one line at a time, appending them to an array
  2. np.fromstring after reading in the entire file
  3. np.genfromtext
  4. Manual parsing of the file (since all data is integers, this was fairly easy to code)

Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:

str = " " * 511000000 # Start at 511 MB
while 1:
    str = str + " " * 1000 # Add 1 KB at a time

Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...

I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)

I copied the code from the answer exactly:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:

MemoryError: cannot allocate array memory

Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)

As I'm running Python 2.7, this was not my issue. Any help would be appreciated.

Community
  • 1
  • 1
stevendesu
  • 15,753
  • 22
  • 105
  • 182
  • 3
    have you tried to do it in two passes? 1st pass: calculate array dimensions `nxm` and dtypes. 2nd pass: put data in *preallocated* array (specifing `dtype`, `count` for `np.fromiter()` might be enough) – jfs Dec 06 '13 at 13:37
  • I actually already know the array dimensions (7049 x 9146), so I'll try this. EDIT - 9246, not 9146. Immaterial, though – stevendesu Dec 06 '13 at 13:40
  • It worked! Please post as an answer so I can accept it. Bonus points: it ran in like, 8 seconds! I was extremely surprised. – stevendesu Dec 06 '13 at 13:45
  • 4
    you can [post your own answer](http://stackoverflow.com/help/self-answer). You've done all the work. Please, add small code example that avoids MemoryError. – jfs Dec 06 '13 at 13:59

2 Answers2

5

With some help from @J.F. Sebastian I developed the following answer:

train = np.empty([7049,9246])
row = 0
for line in open("data/training_nohead.csv")
    train[row] = np.fromstring(line, sep=",")
    row += 1

Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:

num_rows = 0
for line in open("data/training_nohead.csv")
    num_rows += 1

For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.

num_rows = 0
max_cols = 0
for line in open("data/training_nohead.csv")
    num_rows += 1
    tmp = line.split(",")
    if len(tmp) > max_cols:
        max_cols = len(tmp)

This solution works best for numerical data, as a string containing a comma could really complicate things.

stevendesu
  • 15,753
  • 22
  • 105
  • 182
  • 2
    note: `for i, line in enumerate(file)` and `ncols = max(ncols, len(line.split(',')))` builtin functions that you could use here. In general (not in this case), a cvs row may span several physical lines i.e., the correct way to enumerate csv rows is: `for i, row in enumerate(csv.reader(file))`. – jfs Dec 06 '13 at 18:02
0

This is an old discussion, but might help people in present.

I think I know why str = str + " " * 1000 fails fester than str = " " * 2048000000

When running the first one, I believe OS needs to allocate in memory the new object which is str + " " * 1000, and only after that it reference the name str to it. Before referencing the name 'str' to the new object, it cannot get rid of the first one. This means the OS needs to allocate about the 'str' object twice in the same time, making it able to do it just for 1 gig, instead of 2 gigs. I believe using the next code will get the same maximum memory out of your OS as in single allocation:

str = " " * 511000000
while(1):
    l = len(str)
    str = " "
    str = " " * (len + 1000)

Feel free to roccet me if I am wrong

Shaq
  • 303
  • 1
  • 10