0

I have quite a few text files of +- 4 GB, and when I read them at once in Python it gives me an MemoryError (although looking at PC performance it doesn't come even close to the max memory). When iterating through the file the script becomes much much slower. Does anyone have a solution how to easily read such large files in an fast way? Or to Increase the memory limit in Python?

Thanks.

Coryza
  • 231
  • 1
  • 3
  • 12
  • Take a look at mmap: http://docs.python.org/2/library/mmap.html – Jayanth Koushik Feb 27 '14 at 07:41
  • What problem are you trying to solve ? Are you reading in the files in order to aggregate some data, or do text comparisons ? A little more information on why you are reading in such large files would be useful. – Christian Witts Feb 27 '14 at 08:37
  • *"the max memory"* - are you talking about RAM or virtual memory? Are you using 32-bit or 64-bit Python? Maximum size of a virtual address on 32-bit is 0xffffffff - 4GB, and on Windows only half of that is available to the user's code in a process address space. So on 32-bit you only have a max. of 2GB, *regardless of how much RAM you have*. Do you really need all that data in memory at the same time? – cdarke Feb 27 '14 at 08:51
  • I indeed use 32bit Python. I'm trying to do data analyses over text files consisting of mapping data of RNA-Seq against human reference genome. This means that each file consists of +- 4000.000 sequence lines with tabbed information in each line. I just want to do some basic analysis (count some things etc). Iterating over each line takes 3 minutes per file, and then thats just step 1 of my analysis :(. – Coryza Feb 27 '14 at 09:19

1 Answers1

0

If you are reading a large file and then storing the lines in arrays, then you are actually doubling the required memory size.

One source can be if you are using line = input.readlines(). If that is the source of the problem, you can replace that with this:

for item in input:
    function(item)

that iterates over each line.

Also consider using the csv library if your text file is a CSV.

(source)

Community
  • 1
  • 1
philshem
  • 24,761
  • 8
  • 61
  • 127
  • I think you will need to add some code and input to totally diagnose the issue. Also, consider posting some timing results for what you have tried. – philshem Feb 27 '14 at 08:31
  • Each file takes about +- 3 min to process (when iterating over lines). +- 4 million lines per text file. – Coryza Feb 27 '14 at 09:20
  • How do you read lines without iterating over lines?If you are using linux, you might find it faster to parse with grep or awk first, into smaller files, and then read those files as needed. – philshem Feb 27 '14 at 09:33