14

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.

Any alternatives?

pavlogiannis
  • 303
  • 1
  • 3
  • 6

8 Answers8

8

It really depends on your definition of word. But try this:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

This will use whitespace characters as word boundaries.

Of course, remember to properly open and close the file, this is just a quick example.

Rosenthal
  • 149
  • 1
  • 2
  • 11
Andrea Spadaccini
  • 12,378
  • 5
  • 40
  • 54
6

Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.

First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.

If not, use something like:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return
Petr Viktorin
  • 65,510
  • 9
  • 81
  • 81
  • You should keep in mind that this one work okay when you want to write one word per line to a file, but don't work if you just want to use it so that it yields only one word. This one don't work when we have `dog\ncat` in chunk. It yields `dog\ncat`, not `dog`, then `cat`. When `dog\ncat` is printed it looks ok, but this is illusive. – siulkilulki Jun 26 '17 at 19:31
4

You really should consider using Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)
laike9m
  • 18,344
  • 20
  • 107
  • 140
3

There are more efficient ways of doing this, but syntactically, this might be the shortest:

 words = open('myfile').read().split()

If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
1

I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)
Community
  • 1
  • 1
smac89
  • 39,374
  • 15
  • 132
  • 179
0

What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

longer version of what Donald Miner suggested.

Vikas
  • 1,900
  • 1
  • 19
  • 20
0

Read in the line as normal, then split it on whitespace to break it down into words?

Something like:

word_list = loaded_string.split()
0

After reading the line you could do:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex.

Arjor
  • 979
  • 1
  • 8
  • 12