Python Stream Extraction

Question

The standard library of many programming languages includes a "scanner API" to extract strings, numbers, or other objects from text input streams. (For example, Java includes the Scanner class, C++ includes istream, and C includes scanf).

What is the equivalent of this in Python?

Python has a stream interface, i.e. classes that inherit from io.IOBase. However, the Python TextIOBase stream interface only provides facilities for line-oriented input. After reading the documentation and searching on Google, I can't find something in the standard Python modules that would let me, for example, extract an integer from a text stream, or extract the next space-delimited word as a string. Are there any standard facilities to do this?

If the input is very dirty/arbitrary, I tend to use the `re` module; if input is structured I prefer a parser library like simpleparse (EBNL is easier to maintain than regular expressions). — Paulo Scardine, Feb 16 '13 at 01:25
If you do have an use case in mind, perhaps it would be more productive to update your question instead of making a generic inquire. — Paulo Scardine, Feb 16 '13 at 01:34
See http://stackoverflow.com/questions/2175080/sscanf-in-python for some other suggestions. — Ned Deily, Feb 16 '13 at 01:55
The `re` module can be used to simulate `sscanf` specifically - however it is not really equivalent to the `scanf` family of functions. The `scanf` family of functions are stream-oriented whereas `re` is *string* oriented. For example, `fscanf` will automatically read char-by-char from a file input stream, whereas `re` requires a fully formed string in advance. — Channel72, Feb 16 '13 at 02:00
@Channel72 Not unless you `mmap` the file first (thus trying it like a very large string without loading it to memory), and use `re.finditer` (maybe with some keeping of position, to pass to another finditer if needs be, *or whatever*) — Jon Clements, Feb 16 '13 at 02:07

score 3 · Accepted Answer · answered Feb 16 '13 at 08:58

There is no equivalent of fscanf or Java's Scanner. The simplest solution is to require the user to use newline separeted input instead of space separated input, you can then read line by line and convert the lines to the correct type.

If you want the user to provide more structured input then you probably should create a parser for the user input. There are some nice parsing libraries for python, for example pyparsing. There is also a scanf module, even though the last update is of 2008.

If you don't want to have external dependencies then you can use regexes to match the input sequences. Certainly regexes require to work on strings, but you can easily overcome this limitation reading in chunks. For example something like this should work well most of the time:

import re


FORMATS_TYPES = {
    'd': int,
    'f': float,
    's': str,
}


FORMATS_REGEXES = {    
    'd': re.compile(r'(?:\s|\b)*([+-]?\d+)(?:\s|\b)*'),
    'f': re.compile(r'(?:\s|\b)*([+-]?\d+\.?\d*)(?:\s|\b)*'),
    's': re.compile(r'\b(\w+)\b'),
}


FORMAT_FIELD_REGEX = re.compile(r'%(s|d|f)')


def scan_input(format_string, stream, max_size=float('+inf'), chunk_size=1024):
    """Scan an input stream and retrieve formatted input."""

    chunk = ''
    format_fields = format_string.split()[::-1]
    while format_fields:
        fields = FORMAT_FIELD_REGEX.findall(format_fields.pop())
        if not chunk:
            chunk = _get_chunk(stream, chunk_size)

        for field in fields:
            field_regex = FORMATS_REGEXES[field]
            match = field_regex.search(chunk)
            length_before = len(chunk)
            while match is None or match.end() >= len(chunk):
                chunk += _get_chunk(stream, chunk_size)
                if not chunk or length_before == len(chunk):
                    if match is None:
                        raise ValueError('Missing fields.')
                    break
            text = match.group(1)
            yield FORMATS_TYPES[field](text)
            chunk = chunk[match.end():]



def _get_chunk(stream, chunk_size):
    try:
        return stream.read(chunk_size)
    except EOFError:
        return ''

Example usage:

>>> s = StringIO('1234 Hello World -13.48 -678 12.45')
>>> for data in scan_input('%d %s %s %f %d %f', s): print repr(data)
...                                                                                            
1234                                                                                           
'Hello'
'World'
-13.48
-678
12.45

You'll probably have to extend this, and test it properly but it should give you some ideas.

score 1 · Answer 2 · answered Feb 16 '13 at 01:28

There is no direct equivalent (as far as I know). However, you can do pretty much the same thing with regular expressions (see the re module).

For instance:

# matching first integer (space delimited)
re.match(r'\b(\d+)\b',string)

# matching first space delimited word
re.match(r'\b(\w+)\b',string)

# matching a word followed by an integer (space delimited)
re.match(r'\b(\w+)\s+(\d+)\b',string)

It requires a little more work than the usual C-style scanner interface, but it is also very flexible and powerful. You will have to process stream I/O yourself though.

Python Stream Extraction

2 Answers2

Linked