1

This question was asked earlier, but quite a while ago. I am currently trying to open a very large file (20GB) to manipulate stuff.

I am using:

read_path = '../text/'
time = 3600
data = open(read_path+'genomes'+str(time)).read().replace(',','\n').replace('\n','')

and it works fine when I choose a smaller file in the same directory (genomes1000), but when I change the time to the one matching the larger file I get the error.

The exact error message is:

Tempo:analytics scottjg$ python genomeplot.py 
Traceback (most recent call last):
  File "genomeplot.py", line 27, in <module>
    data = open(read_path+'genomes'+str(time)).read().replace(',','\n').replace('\n','')
OSError: [Errno 22] Invalid argument
Thoughts?
cancerconnector
  • 1,225
  • 2
  • 14
  • 21
  • So what's the error? – Remi Guan Jan 15 '16 at 16:15
  • Do you mean Errno 22 like in [this question](http://stackoverflow.com/questions/15598160/ioerror-errno-22-invalid-mode-r-or-filename-c-python27-test-txt)? It seems to indicate that the file doesn't exist, check that the path is actually accurate. – SuperBiasedMan Jan 15 '16 at 16:15
  • yes, just like in that question: but the path is accurate! I know because if i just change the 'time' in the above code to one associated with a smaller file it works fine. – cancerconnector Jan 15 '16 at 16:17
  • I get: Tempo:analytics scottjg$ python genomeplot.py Traceback (most recent call last): File "genomeplot.py", line 27, in data = open(read_path+'genomes'+str(time)).read().replace(',','\n').replace('\n','') OSError: [Errno 22] Invalid argument – cancerconnector Jan 15 '16 at 16:18
  • Edit your question to add the exact error you're getting. – Vincent Savard Jan 15 '16 at 16:18
  • 3
    Are you really trying to read a 20GB file into memory all at once? – Robert Jacobs Jan 15 '16 at 16:20
  • Yes, I am... is this the issue? – cancerconnector Jan 15 '16 at 16:23
  • 1
    Can you print `read_path+'genomes'+str(time)` and make sure that the file exists? – SuperBiasedMan Jan 15 '16 at 16:29
  • yes, it prints fine. same error. like i said, i can access other files in the same directory genomes1000 for example which are smaller. ../text/genomes3600 Traceback (most recent call last): File "genomeplot.py", line 28, in data = open(read_path+'genomes'+str(time)).read().replace(',','\n').replace('\n','') OSError: [Errno 22] Invalid argument – cancerconnector Jan 15 '16 at 16:31
  • I suggest you rewrite your question to "how can I process a very large file in Python?", you will likely get more views and good answers. – sleblanc Jan 15 '16 at 16:34
  • 1
    to read 20GB into memory you need at least the same amount RAM, and you do 2 replaces, and they do copies of the string each one so you end needing like 60GB of ram, and not even mention handle such monstrous string... – Copperfield Jan 15 '16 at 16:34
  • You might want to try memory mapped IO. Be careful since a change in memory could change the file as well. http://pythoncentral.io/memory-mapped-mmap-file-support-in-python/ – Robert Jacobs Jan 15 '16 at 17:14
  • Getting back to the OSError which is likely happening in the `open` call, how about printing out `repr(read_path+'genomes'+str(time))` right before the `open` to make sure there isn't something odd going on. Names that are too long or that have control chars like `\n` in them raise errors (different on different systems). – tdelaney Jan 15 '16 at 17:14

1 Answers1

2

Your code reads the total contents of the file into memory:

open(read_path+'genomes'+str(time)).read()

I suspect that you do not have memory available to accomodate this and that is probably the reason for the failure. Wouldn't it be better to process it line by line with a call to readline in a loop instead?

Joppe
  • 1,465
  • 1
  • 12
  • 17
  • I thought this might be the issue. I've been lazy up to now because my files have been smaller. – cancerconnector Jan 15 '16 at 16:44
  • If the path matches the file I would say that this is the problem, even if I would not expect to see Errno 22 in connection with this. – Joppe Jan 15 '16 at 20:49