I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)
I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",")
but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.
I've since tried SEVERAL other tactics, including:
import fileinput
and read one line at a time, appending them to an arraynp.fromstring
after reading in the entire filenp.genfromtext
- Manual parsing of the file (since all data is integers, this was fairly easy to code)
Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:
str = " " * 511000000 # Start at 511 MB
while 1:
str = str + " " * 1000 # Add 1 KB at a time
Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000
(fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...
I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)
I copied the code from the answer exactly:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
Calling iter_loadtxt("data/training_nohead.csv")
gave a slightly different error this time:
MemoryError: cannot allocate array memory
Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)
As I'm running Python 2.7, this was not my issue. Any help would be appreciated.