2

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.

I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).

def words(file):
    for line in file:
        words=re.split("\W+", line)
        for w in words:
            word=w.lower()
            if word != '': yield word

Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?

ctrl-alt-delor
  • 7,506
  • 5
  • 40
  • 52
  • 1
    See python how to read a file character by character http://stackoverflow.com/questions/2988211/how-to-read-a-single-character-at-a-time-from-a-file-in-python If the word is the entire line you may still have a problem but maybe in the that case you can drop it in advance – Tom Ron Mar 04 '14 at 11:03
  • 1
    Related: [How to read records terminated by custom separator from file in python?](http://stackoverflow.com/q/19600475/222914) – Janne Karila Mar 04 '14 at 12:59

1 Answers1

5

Don't read line by line, read in buffered chunks instead:

import re

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        for word in (w.lower() for w in words if w):
            yield word

    if buffer:
        yield buffer.lower()            

I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.

If you are using Python 3.3 or newer, you can use generator delegation here:

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        yield from (w.lower() for w in words if w)

    if buffer:
        yield buffer.lower()            

Demo using a small chunk size to demonstrate this all works as expected:

>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
...     print word
... 
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • After code reviewing, I ran it, but changed chunk size to `3`. No problems found. – ctrl-alt-delor Mar 04 '14 at 11:20
  • The only think I don't like is it is not super elegant. Can someone else produce something more simple? – ctrl-alt-delor Mar 04 '14 at 11:21
  • @richard: In Python 3.3 and up, you can use `yield from map(str.lower, filter(None, words))` instead of the `for word in words:` loop, and you could use a `while chunk:` loop with two `chunk = file.read(buffersize)` calls (one to prime the `while` loop, one in the loop) instead of the `iter()`-with-callable-and-sentinel construct I used, but unless you go for the `mmap` solution, this is pretty much it. – Martijn Pieters Mar 04 '14 at 11:25
  • 1
    Note: the mmap answer will not work, as it tries to map a potentially huge file (>1TB) into a potentially small process (~3Gb). (The answer has now been deleted) – ctrl-alt-delor Mar 04 '14 at 12:38