3

Is there a well-hidden way to read tokens from a file or file-like object without reading entire lines? The application I immediately have (someone else's problem, not mine) is transposing a large matrix with a few very long rows, essentially performing an itertools.izip() on iterators that pick out the elements of a single column. The idea is not not have the entire file in memory during iteration.

The rows are space-delimited ASCII decimal numbers.

The problem would be simple with Java's Scanner class, but I don't see anything in the Python Standard Library that appears to tokenize without having the whole input in a string.

For the record, I know how to write this on my own. I'm just wondering if there's a standard tool that I missed. Something FOSS/libre that can be EasyInstalled is good, too, but I don't see anything on PYPI either.

The full problem was to take the sample input:

"123 3 234234 -35434 112312 54 -439 99 0 42\n" +
"13 456 -78 910 333 -44 5555 6 8"

...and produce the output (as a generator, without reading all of very long rows into memory at once:

[123, 13], [3, 456], [234234, -78], ...etc

As I said, it's essentially itertools.izip(iterator1, iterator2), pointing iterator1 at the start of the file, and iterator2 just past the newline to read the second row.

Mike Housky
  • 3,959
  • 1
  • 17
  • 31

4 Answers4

4

To read tokens from a file one by one; you could use re module to generate tokens from a memory-mapped file:

#!/usr/bin/env python3
import re
import sys
from mmap import ACCESS_READ, mmap    

def generate_tokens(filename, pattern):
    with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm:
         yield from re.finditer(pattern, mm)

# sum all integers in a file specified at the command-line
print(sum(int(m.group()) for m in generate_tokens(sys.argv[1], br'\d+')))

It works even if the file doesn't fit in memory.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Looks interesting, and semi-portable. I don't care much for the "from" import hiding the module name, but that's a minor nit. – Mike Housky Nov 17 '13 at 18:49
  • @MikeHousky: It should work as is on Unix and Windows. Let me know if it is not so. – jfs Nov 19 '13 at 06:48
  • Three useful answers, all upvotes. I picked this one to accept because it called my attention to the mmap package, which I hadn't seen before. – Mike Housky Nov 21 '13 at 20:34
2

Here is a generator that processes a file one character at a time and yields tokens when whitespace is encountered.

def generate_tokens(path):
    with open(path, 'r') as fp:
        buf = []
        while True:
            ch = fp.read(1)
            if ch == '':
                break
            elif ch.isspace():
                if buf:
                    yield ''.join(buf)
                    buf = []
            else:
                buf.append(ch)

if __name__ == '__main__':
    for token in generate_tokens('input.txt'):
        print token

To be more generic, it looks like you might be able to use the re module as described at this link. Just feed the input with a generator from your file to avoid reading the whole file at once.

Python equivalent of ruby's StringScanner?

Community
  • 1
  • 1
FogleBird
  • 74,300
  • 25
  • 125
  • 131
  • Thanks for the code and the link. I hadn't thought of looking for a parallel to something in Ruby. A similar search for Java Scanner replacement gave only full-string solutions. – Mike Housky Nov 16 '13 at 17:54
  • accumulating input one byte at a time is [very slow](http://stackoverflow.com/a/3054831/4279) in Python – jfs Nov 16 '13 at 22:43
2

You can read file in chunks with file.read(size). I would not recomment however to read by 1 byte, as this will drastically affect performance. Following snippet (not much tested, use on your own risk) reads file in chunks an yields numbers. You'll have to read through file first to determine rows starting position though.

def values_chunks(file_object, pos_from=0, chunk_size=32*1024):
    file_object.seek(pos_from)
    eol = False
    tail = ''
    while True:
        raw_data = file_object.read(chunk_size)
        raw_data = tail + raw_data
        raw_data = raw_data.split('\n', 1) # to check for eol, split in tuple
        if len(raw_data) > 1:
            eol = True
        raw_data = raw_data[0]
        raw_values = raw_data.split()
        if not eol and raw_data[-1] != ' ':
            tail = raw_values[-1]
            raw_values = raw_values[:-1]
        else:
            tail = ''
        for value in raw_values: # either case we need only first tuple elem
            yield int(value)
        if not raw_data[0] or eol: # eof/eol
            break

>>> with open('test', 'wb') as test:
...     test.write(' '.join(map(str, range(10**5))))
...     test.write('\n')
...     test.write(' '.join(map(str, range(10**4))))
...
>>> values = list(values_chunks(open('test', 'rb')))
>>> len(values)
100000
>>> sum(values)
4999950000L
alko
  • 46,136
  • 12
  • 94
  • 102
  • This is closer to what I'd write, except maybe to search specifically for whitespace rather than use `split` calls. If the "client" for this favor (not really a client if I'm not really getting paid, right? :^/ ) has a problem with megabytes, then the tokenization is probably worth minimizing too. Thanks for the code, and the different ideas. They could certainly be useful on a less-weird problem! – Mike Housky Nov 16 '13 at 18:08
  • 2
    you could look at how [`.readline(limit)`, `.readlines(hint)` are implemented](http://hg.python.org/cpython/file/097389413dd3/Lib/_pyio.py#l451). Just replace `'\n'` (newline) with `' '` (space). – jfs Nov 16 '13 at 22:38
0
# python, read token file
# Put token on first line of a token.txt file. 

token = open("token.txt","r").readline()  # I've opted to just save my token to a text file.
token = token.rstrip()  
...

print(token)
slfan
  • 8,950
  • 115
  • 65
  • 78
  • Please put your answer always in context instead of just pasting code. See [here](https://stackoverflow.com/help/how-to-answer) for more details. – gehbiszumeis Jan 16 '20 at 06:27