0

I use openpyxl and numpy to read large excel files. The code looks like

W = load_workbook(fname, read_only = True)
p = W.worksheets[0]
a=[]
m = p.max_row
n = p.max_column
for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example 5*6)
aa= np.resize(a, [m, n])

for medium sized files (4MB excel file with 16000 rows and 50 columns) they work fine. However for large files (21B with 100000 rows and 50 columns), numpy fails with memory error. There is memory available on the system.

1- How can I find how much memory it took while resizing to matrix?

2- How can I increase the memory (something like heap size in java)?

Traceback (most recent call last):
  File "exread.py", line 26, in <module>
    aa= np.resize(a, [m, n])
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packa
ges\numpy\core\fromnumeric.py", line 1121, in resize
    a = ravel(a)
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\fromnumeric.py", line 1468, in ravel
    return asanyarray(a).ravel(order=order)
  File "C:\Users\m.naderan\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\numeric.py", line 583, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
MemoryError
mahmood
  • 23,197
  • 49
  • 147
  • 242
  • What's the point to this use of `resize`? It is rarely needed. Use np.array to make an array from a list. – hpaulj May 10 '17 at 08:14
  • Well I thought it will neatly resize the array to m*n. So please let me know how to use `np.array`./ – mahmood May 10 '17 at 08:26
  • 1
    You might just use `aa = np.array([[i.value for i in j] for j in p.rows])` instead of everything except the first two lines. – Nyps May 10 '17 at 08:29
  • @Nyps: Sorry, I didn't understand. Can you please explain in an answer what does that statement do exactly? – mahmood May 10 '17 at 09:24
  • This reads in all values in your worksheet directly to the numpy array. It loops through all elements in all rows. Just try replacing the code. – Nyps May 10 '17 at 09:26
  • 1
    Like this: `W = load_workbook(fname, read_only = True)`, `p = W.worksheets[0]`, `aa = np.array([[i.value for i in j] for j in p.rows])` as your whole code. – Nyps May 10 '17 at 09:33
  • Still I get memory error :( – mahmood May 10 '17 at 11:23

2 Answers2

0
  1. The most pragmatic way to check the memory usage of an operation would probably be to just watch top/htop if you're on a Unix system. Someone did post a Python solution to this 5 years ago.

  2. I may be wrong on this, but I think there is no restriction on the memory usage of a Python kernel by default, i.e. MemoryErrors really only happen when there genuinely isn't enough available memory on your entire system (I've run scripts consuming over 50GB of memory before).

Community
  • 1
  • 1
Ken Wei
  • 3,020
  • 1
  • 10
  • 30
0

The documentation contains a clear example of how to convert a worksheet to a dataframe. This easier to use and more reliable than your own code so why not use it?

Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
  • The examples are not clear. What is the need for data frame and why we should use it and what will happen if we don't use that? I just want to read the cells row by row – mahmood May 10 '17 at 09:28