0

As the title says, how do I read from stdin or from a file word by word, rather than line by line? I'm dealing with very large files, not guaranteed to have any newlines, so I'd rather not load all of a file into memory. So the standard solution of:

for line in sys.stdin:
    for word in line:
        foo(word)

won't work, since line may be too large. Even if it's not too large, it's still inefficient since I don't need the entire line at once. I essentially just need to look at a single word at a time, and then forget it and move on to the next one, until EOF.

EDIT: The suggested "duplicate" is not really a duplicate. It mentions reading line by line and THEN splitting it into words, something I explicitly said I wanted to avoid.

Peatherfed
  • 178
  • 1
  • 10
  • Is this file you're looking at really a text file? I think despite what you say regarding memory efficiency of reading line-by-line, if a file has huge amounts of text with no line break, it'd be hard even for a human to read. I recommend using `read()` with a set size, but you'll have to be aware that the chunk you choose may straddle word boundaries, so you may possibly have to save several chunks of data before you find a whitespace to split it on. – Ben Y Jun 16 '21 at 18:22
  • You'll also have to buffer any additional bytes read that belong to the next word. The only other alternative is to read one byte at a time, accumulating non-breaking bytes in `word` until you reach a breaking point. – chepner Jun 16 '21 at 18:23
  • You cannot read "word by word", as the input package has no "word" concept. You can read by line, or by character. Details are easy enough to look up in the package documentation (likely `file`). Stack Overflow is not intended to replace existing tutorials and documentation. – Prune Jun 16 '21 at 18:24
  • There is already an answer here https://stackoverflow.com/a/42732391/11255447 – aparpara Jun 16 '21 at 18:26
  • @aparpara that's expressedly what the OP said they did not want to do, so I don't believe that's the appropriate answer. – Ben Y Jun 16 '21 at 18:30
  • @BenY that's expressedly what the OP said they *did* want to do: read a file word by word, not line by line. – aparpara Jun 16 '21 at 18:32
  • Perhaps the best you can do is read a chunk at a time, like `while True:` / `chunk = sys.stdin.read(65536)`. The only complication is that you have to deal with a word that splits across the chunk boundary, That's not hard to do. – Tim Roberts Jun 16 '21 at 18:33
  • I recommend you look at the answer you suggested, it is exactly the same as what the OP posted they didn't want to do. – Ben Y Jun 16 '21 at 18:33
  • No, sorry, I recommend *you* look at the answer I suggested. There is nothing about lines there. Is it so hard to notice the line `buf = f.read(10240)`? – aparpara Jun 16 '21 at 18:37
  • You can read files in entirety, by line, or by byte(s). If you need to read really large files in python and cannot depend on newlines then you must read by bytes. – Amos Baker Jun 16 '21 at 19:07

2 Answers2

1

Here's a generator approach. I don't know when you plan to stop reading, so this is a forever loop.

def read_by_word(filename, chunk_size=16):
    '''This generator function opens a file and reads it by word'''
    buff = ''  # Preserve word from previous
    with open(filename) as fd:
        while True:
            chunk = fd.read(chunk_size)
            if not chunk:  # Empty means end of file
                if buff:  # Corner case -- file had no whitespace at end
                     # Unfortunately, big chunk sizes could make the
                     # final chunk have spaces in it
                     yield from buff.split()
                break
            chunk = buff + chunk  # Add any previous reads
            if chunk != chunk.rstrip():
                yield chunk.rstrip()  # This chunk ends with whitespace
                buff = ''
            else:
                comp = chunk.split(None, 1)  # At most 1 with whitespace
                if len(comp) == 1:
                    buff += chunk
                    continue
                else:
                    yield comp[0]
                    buff = comp[1]


for word in read_by_word('huge_file_with_few_newlines.txt'):
     print(word)
Ben Y
  • 913
  • 6
  • 18
0

Here's a straightforward answer, which I'll post if anyone else goes looking and doesn't feel like wading through toxic replies:

word = ''
with open('filename', 'r') as f:
    while (c := f.read(1)):
        if c.isspace():
            if word:
                print(word) # Here you can do whatever you want e.g. append to list
            word = ''
        else:
            word += c

Edit: I will note that it would be faster to read larger byte-chunks at a time, and detecting words after the fact. Ben Y's answer has an (as of this edit) incomplete solution that might be of assistance. If performance (rather than memory, as was my issue) is a problem, that should probably be your approach. The code will be quite a bit longer, however.

Peatherfed
  • 178
  • 1
  • 10
  • 1
    This will be rather slow since you're doing 1-byte reads from the file. A more optimal solution would read in chunks of e.g. 1024 bytes and split from there to words. – AKX Jun 16 '21 at 19:12
  • 1
    In fact, Ben Y's solution just below does that. – AKX Jun 16 '21 at 19:12
  • Will this code go through an endless loop at eof? – Ben Y Jun 16 '21 at 19:17
  • No, the while loop controls it. – Peatherfed Jun 16 '21 at 19:18
  • Ah, missed the walrus, which I have yet to use in my code. – Ben Y Jun 16 '21 at 19:26
  • 1
    Gotta love the walrus! I've edited my answer with a reference to yours. As I mentioned in my original question, memory was the issue for me, not performance. As such, I'll probably accept my answer rather than yours since it's significantly shorter and easier to understand. However, I included the reference specifically for people who are looking for better performance. I hope you won't take offence. – Peatherfed Jun 16 '21 at 19:28
  • I'm still supporting Python 3.6 in some cases, so unfortunately, I have yet to embrace the walrus. It sure would make a lot of my code shorter. No offense taken, and glad you found your own answer. I guess I automatically start doing O(N) analysis and such, so maybe it's a bad habit. – Ben Y Jun 16 '21 at 19:29