Python, running into memory error when parsing a 30MB file(already downloaded into my local computer)

Question

Here is my download address, the file name is 'kosarak'

http://fimi.uantwerpen.be/data/

My parsing code is:

parsedDat = [line.split() for line in open('kosarak.dat').readlines()]

I need this data as a whole to run some method on it, so read one line by one line and do the operation on each line is not fit for me here.

The file is only 30 MB and my computer has at least 10G memory left and 30+G Hard drive place,So I guess there shouldn't be any resource problem

FYI: My python version is 2.7 and I am running my python inside Spyder. My OS is windows 10.

PS: You don't need to use my parsing code/method to do the job,as long as you could get the data from file to my python environment that would be perfect.

Every process is allocated some memory for it to carry out it's operations and if it uses more than that it gives a memory error. The overall RAM has nothing to do with it. Is your system Windows or Linux? Here since you're running it with Spyder. The memory used by Spyder will also come into picture. — Apurva Singh, Sep 25 '19 at 04:22
Have a look at https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python for some methods for handling large data files one chunk at a time - this is probably a good workaround for your problem — PeptideWitch, Sep 25 '19 at 04:24
Possible duplicate of [Lazy Method for Reading Big File in Python?](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) — PeptideWitch, Sep 25 '19 at 04:25
Python's memory consumption is not very efficient. On my machine, one Python integer takes at least 28 bytes of memory; an empty string takes 49 bytes. Thus, your 30MB file will baloon up in memory, depending on how many fields you have per line. Depending on what kind of data you have in your file, `numpy` (or `pandas`) can help with more efficient memory storage. — Amadan, Sep 25 '19 at 04:25
@PeptideWitch Ok, I will take a look at that question see if some answer works for me — cloudscomputes, Sep 25 '19 at 04:28
@Amadan, Thanks for your info and I am going to using pandas and numpy to deal with it. — cloudscomputes, Sep 25 '19 at 04:29
Are you sure you installed a 64 bit version of Python? 10 GB of memory won't do much for you if you're running a 32 bit version of Python (with user virtual address space limited to 2 GB, and Spyder might be fragmenting it badly). Also, you would (roughly) halve your peak memory usage by removing the `.readlines()`; the file object itself is (lazily) iterable, calling `.readlines()` forces it to eagerly slurp the whole file into memory, only to iterate it, then throw it away. May as well iterate lazily and only store the split lines, not the split lines and the original lines at once. — ShadowRanger, Sep 25 '19 at 04:29
Note that reading in a 30 MB file, by itself, should not be an issue, even if you slurp the whole thing and store two fragmented copies of the data, but if Spyder is running your script in its own process, and is itself a 32 bit process with meaningful memory usage, your script might just be the straw that broke the camel's back. — ShadowRanger, Sep 25 '19 at 04:31

Alexander · Answer 1 · 2019-09-25T05:14:30.430

4

Perhaps this may help.

with open('kosarak.dat', 'r') as f:  # Or 'rb' for binary data.
    parsed_data = [line.split() for line in f]

The difference being that your approach reads all of the lines in the file at once and then processes each one (effectively requiring 2x memory, once for the file data and once again for the parsed data, all of which must be stored in memory at the same time), whereas this approach just reads the file line by line and only needs the memory for the resulting parsed_data.

In addition, your method did not explicitly close the file (although you may just not have shown that portion of your code). This method uses a context manager (with expression [as variable]:) which will close the object automatically once the with block terminates, even following an error. See PEP 343.

edited Sep 25 '19 at 05:14

answered Sep 25 '19 at 04:40

Alexander

105,104
32
201
196

2

I think you forgot to actually loop? `for line in f: parsed_data.append(line.split())` would be more performant and Pythonic than manual calls to `.readline()` in any event, while still sticking to lazy iteration. Or just keep it a list comprehension, `parsed_data = [line.split() for line in f]` – ShadowRanger Sep 25 '19 at 04:50
Thanks! But this only gives me one line of the data – cloudscomputes Sep 25 '19 at 04:51
I don't know why but with open('kosarak.dat', 'r') as f: for line in f: parsed_data.append(line.split()) works for my case – cloudscomputes Sep 25 '19 at 04:54
@cloudscomputes As discussed above, it is much more efficient with respect to memory to read the file line by line, processing the data as you go (although it may be slower than `.readlines()`). You could even output the processed data to a secondary file which would then take virtually no memory to run. – Alexander Sep 25 '19 at 04:58

Python, running into memory error when parsing a 30MB file(already downloaded into my local computer)

1 Answers1