Python load 2GB of text file to memory

Question

In Python 2.7, when I load all data from a text file of 2.5GB into memory for quicker processing like this:

>>> f = open('dump.xml','r')
>>> dump = f.read()

I got the following error:

Python(62813) malloc: *** mmap(size=140521659486208) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

Why did Python try to allocate 140521659486208 bytes memory for 2563749237 bytes data? How do I fix the code to make it loads all the bytes?

I'm having around 3GB RAM free. The file is a Wiktionary xml dump.

Why don't you parse the XML linearly without loading the source into memory first? — Alfe, Jun 22 '12 at 15:17
I tried it and it took me very long. And since I have lots of RAM, I want to load everything into RAM to make it faster. — pckben, Jun 22 '12 at 15:21
I'm very new to Python. Now I'm reading on that. What puzzled me the most was why would Python want to allocate like 50K bytes memory per byte data? What kind of data structure behind the f.read() line is that? — pckben, Jun 22 '12 at 15:33
I don't think that reading the XML source into memory to parse it afterwards will speed up anything. Parse it while reading it. That's faster. — Alfe, Jun 22 '12 at 15:58
BTW, use a SAX parser instead of a DOM parser. Otherwise you will again need huge amounts of RAM. — Alfe, Jun 22 '12 at 15:59
@Alfe I have tried XML parsing by accumulating the lines until a complete article entry, it took me forever. So I changed to this solution to use regex finditer() to linearly match the pattern and extract the interested data instead. Now it only took me more than a minute. — pckben, Jun 22 '12 at 16:27
I suggest using xml ElementTree over using SAX. No one but you will like your regex solution. — Asclepius, Jun 22 '12 at 16:34
[Others on SO](http://stackoverflow.com/questions/3364279/has-anyone-parsed-wiktionary) have looked at this very XML file to parse. You may want to look at that. — , Jun 22 '12 at 17:53
Thanks, I'm aware of those links. I was just trying to separate the page title and content from the dump for doing experiments later. ElementTree was the one that took a long time before I switching over to the regex solution. It is working fine now. — pckben, Jun 23 '12 at 02:09

score 13 · Accepted Answer · edited Jun 22 '12 at 16:08

13

If you use mmap, you'll be able to load the entire file into memory immediately.

import mmap

with open('dump.xml', 'rb') as f:
  # Size 0 will read the ENTIRE file into memory!
  m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only

  # Proceed with your code here -- note the file is already in memory
  # so "readine" here will be as fast as could be
  data = m.readline()
  while data:
    # Do stuff
    data = m.readline()

edited Jun 22 '12 at 16:08

Thomas Orozco

53,284
11
113
116

answered Jun 22 '12 at 15:34

joslinm

7,845
6
49
72

I got `mmap.error: [Errno 13] Permission denied` for the line with `m = mmap.mmap(..)`, how do i fix it? – pckben Jun 22 '12 at 15:42
2

@pckben That's because the file is open in read-only mode and mmap will try to map read-write: add `prot=mmap.PROT_READ` in your `mmap.mmap` call, and you'll be fine. – Thomas Orozco Jun 22 '12 at 15:49
Nice answer if you really have to read the contents of a file completely. In this case I don't think that this is the best solution for pckben's situation. – Alfe Jun 22 '12 at 16:01
2

mmap is memory mapping of a file. Accessing the memory at the allocated place will access the file instead. Whether the OS buffers the whole file beforehand or only on access, is part of the configuration ;-) – Alfe Jun 22 '12 at 16:03
@pckben Using `open('myfile', 'rb')` opens the file in read-only mode, but `mmap` will try to map it read-write, which causes the error. – Thomas Orozco Jun 22 '12 at 16:10

score -1 · Answer 2 · answered Jun 22 '12 at 15:24

-1

Based on some quick googling, I came across this forum post that seems to address the issue that you appear to be having. Assuming that you are running Mac or Linux based on the error code, you may try implementing garbage collection with gc.enable() or gc.collect() as suggested in the forum post.

answered Jun 22 '12 at 15:24

acrognale

170
1
2
8

my code is only 2 lines as given for loading data into the memory, there's no other living object for garbage collection. – pckben Jun 22 '12 at 15:30

Python load 2GB of text file to memory

2 Answers2

Linked