numpy fails to resize large matrix

Question

I use openpyxl and numpy to read large excel files. The code looks like

W = load_workbook(fname, read_only = True)
p = W.worksheets[0]
a=[]
m = p.max_row
n = p.max_column
for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example 5*6)
aa= np.resize(a, [m, n])

for medium sized files (4MB excel file with 16000 rows and 50 columns) they work fine. However for large files (21B with 100000 rows and 50 columns), numpy fails with memory error. There is memory available on the system.

1- How can I find how much memory it took while resizing to matrix?

2- How can I increase the memory (something like heap size in java)?

Traceback (most recent call last):
  File "exread.py", line 26, in <module>
    aa= np.resize(a, [m, n])
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packa
ges\numpy\core\fromnumeric.py", line 1121, in resize
    a = ravel(a)
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\fromnumeric.py", line 1468, in ravel
    return asanyarray(a).ravel(order=order)
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\numeric.py", line 583, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
MemoryError

What's the point to this use of `resize`? It is rarely needed. Use np.array to make an array from a list. — hpaulj, May 10 '17 at 08:14
Well I thought it will neatly resize the array to m*n. So please let me know how to use `np.array`./ — mahmood, May 10 '17 at 08:26
You might just use `aa = np.array([[i.value for i in j] for j in p.rows])` instead of everything except the first two lines. — Nyps, May 10 '17 at 08:29
@Nyps: Sorry, I didn't understand. Can you please explain in an answer what does that statement do exactly? — mahmood, May 10 '17 at 09:24
This reads in all values in your worksheet directly to the numpy array. It loops through all elements in all rows. Just try replacing the code. — Nyps, May 10 '17 at 09:26
Like this: `W = load_workbook(fname, read_only = True)`, `p = W.worksheets[0]`, `aa = np.array([[i.value for i in j] for j in p.rows])` as your whole code. — Nyps, May 10 '17 at 09:33

score 0 · Answer 1 · edited May 23 '17 at 11:54

The most pragmatic way to check the memory usage of an operation would probably be to just watch top/htop if you're on a Unix system. Someone did post a Python solution to this 5 years ago.
I may be wrong on this, but I think there is no restriction on the memory usage of a Python kernel by default, i.e. MemoryErrors really only happen when there genuinely isn't enough available memory on your entire system (I've run scripts consuming over 50GB of memory before).

score 0 · Answer 2 · answered May 10 '17 at 09:25

0

The documentation contains a clear example of how to convert a worksheet to a dataframe. This easier to use and more reliable than your own code so why not use it?

answered May 10 '17 at 09:25

Charlie Clark

18,477
4
49
55

The examples are not clear. What is the need for data frame and why we should use it and what will happen if we don't use that? I just want to read the cells row by row – mahmood May 10 '17 at 09:28

numpy fails to resize large matrix

2 Answers2