3

I have some pretty large text files (>2g) that I would like to process word by word. The files are space-delimited text files with no line breaks (all words are in a single line). I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file.

This is my code right now:

with open('big_file_of_words', 'r') as in_file:
        with open('output_file', 'w') as out_file:
            words = in_file.read().split(' ')
            for word in word:
                if d.check(word) == True:
                    out_file.write("%s " % word)

I looked at lazy method for reading big file in python, which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. Basically, I want chunks to be as close to the specified size while splitting only on spaces. Any suggestions?

Community
  • 1
  • 1
Elissa
  • 31
  • 1
  • 2

3 Answers3

6

Combine the last word of one chunk with the first of the next:

def read_words(filename):
    last = ""
    with open(filename) as inp:
        while True:
            buf = inp.read(10240)
            if not buf:
                break
            words = (last+buf).split()
            last = words.pop()
            for word in words:
                yield word
        yield last

with open('output.txt') as output:
    for word in read_words('input.txt'):
        if check(word):
            output.write("%s " % word)
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • what is the **'last'** used for ? can we **go without it**? – pambda Mar 11 '17 at 05:33
  • This has a bug in it if a space is at the end of `buf` after a read and then a split word at the end of the buf on the following read. If testfile.txt: `This is a file with some text in it and no newlines. Some text has punctuation. Test this file out with multiple buffer sizes.` and the read size is 20: `inp.read(20)` You'll see `withsome` `andno` and `punctuation.Test` concatenated. – MRSharff Jun 06 '19 at 19:42
1

You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap, eg:

import mmap
import re

with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
    mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
    for word in re.finditer('\w+', mf):
        # do something
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
0

fortunately Petr Viktorin has already written code for us. The following code reads a chunk from a file, then does a yield for each contained word. If a word spans chunks, that's handled also.

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

https://stackoverflow.com/a/7745406/143880

Community
  • 1
  • 1
johntellsall
  • 14,394
  • 4
  • 46
  • 40