8

Here is a quote from https://stackoverflow.com/users/893/greg-hewgill answer to Explain Python's slice notation.

Python is kind to the programmer if there are fewer items than you ask for. For example, if you ask for a[:-2] and a only contains one element, you get an empty list instead of an error. Sometimes you would prefer the error, so you have to be aware that this may happen.

So when the error is prefered, what is the Pythonic way to proceed ? Is there a more Pythonic way to rewrite this example ?

class ParseError(Exception):
    pass

def safe_slice(data, start, end):
    """0 <= start <= end is assumed"""
    r = data[start:end]
    if len(r) != end - start:
        raise IndexError
    return r

def lazy_parse(data):
    """extract (name, phone) from a data buffer.
    If the buffer could not be parsed, a ParseError is raised.

    """

    try:
        name_length = ord(data[0])
        extracted_name = safe_slice(data, 1, 1 + name_length)
        phone_length = ord(data[1 + name_length])
        extracted_phone = safe_slice(data, 2 + name_length, 2 + name_length + phone_length)
    except IndexError:
        raise ParseError()
    return extracted_name, extracted_phone

if __name__ == '__main__':
    print lazy_parse("\x04Jack\x0A0123456789") # OK
    print lazy_parse("\x04Jack\x0A012345678") # should raise ParseError

edit: the example was simpler to write using byte strings but my real code is using lists.

Community
  • 1
  • 1
kasyc
  • 293
  • 1
  • 9
  • You have asked for error checking at the end, but what about error checking in the beginning/middle? For example if the string was `"\x05Jack\x0a0123456789"` the first element woud be wrong, and the second would be messed up because of the first... or is this not a possibility? – Ethan Furman Nov 10 '11 at 16:42
  • The messed up in your example will raise an index error for phone number because ord('0') > len('123456789'). In my example, this is not the responsability of lazy_parse to validate name and phone grammar. However, an improvement would be to ensure there are no remaining elements after the phone number if we don't want "\x04Jack\x0A123456789Z" to be valid input. – kasyc Nov 11 '11 at 07:51
  • If errors in the beginning/middle are just as likely as errors at the end, only checking for errors at the end is not much of an improvement. – Ethan Furman Nov 11 '11 at 14:13
  • This is not a question of looking where are the errors because we simply can't without knowing what name and phone may look like. This a question of checking if data could be unpacked using the size0-data0-size1-data1 format. If it is not possible, it only mean there is something wrong **somewhere** (at the end or not !). – kasyc Nov 14 '11 at 08:48
  • I see -- you are not validating each piece, but rather the whole thing by using the last piece. – Ethan Furman Nov 14 '11 at 16:43

4 Answers4

5

Here's one way that is arguably more Pythonic. If you want to parse a byte string you can use the struct module that is provided for that exact purpose:

import struct
from collections import namedtuple
Details = namedtuple('Details', 'name phone')

def lazy_parse(data):
    """extract (name, phone) from a data buffer.
    If the buffer could not be parsed, a ParseError is raised.

    """
    try:
        name = struct.unpack_from("%dp" % len(data), data)[0]
        phone = struct.unpack_from("%dp" % (len(data)-len(name)-1), data, len(name)+1)[0]
    except struct.error:
        raise ParseError()
    return Details(name, phone)

What I still find unpythonic about that is throwing away the useful struct.error traceback to replace with a ParseError whatever that is: the original tells you what is wrong with the string, the latter only tells you that something is wrong.

Duncan
  • 92,073
  • 11
  • 122
  • 156
  • You are right, the struct module is a good one when byte strings are involved. It would also be good to keep the traceback details as you said. But I was looking for a more general answer as my real code is not dealing with byte strings but lists. – kasyc Nov 10 '11 at 10:46
3

Using a function like safe_slice would be faster than creating an object just to perform the slice, but if speed is not a bottleneck and you are looking for a nicer interface, you could define a class with a __getitem__ to perform checks before returning the slice.

This allows you to use nice slice notation instead of having to pass both the start and stop arguments to safe_slice.

class SafeSlice(object):
    # slice rules: http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange
    def __init__(self,seq):
        self.seq=seq
    def __getitem__(self,key):
        seq=self.seq
        if isinstance(key,slice):
            start,stop,step=key.start,key.stop,key.step
            if start:
                seq[start]
            if stop:
                if stop<0: stop=len(seq)+stop
                seq[stop-1]
        return seq[key]

seq=[1]
print(seq[:-2])
# []
print(SafeSlice(seq)[:-1])
# []
print(SafeSlice(seq)[:-2])
# IndexError: list index out of range

If speed is an issue, then I suggest just testing the end points instead of doing arithmetic. Item access for Python lists is O(1). The version of safe_slice below also allows you to pass 2,3 or 4 arguments. With just 2 arguments, the second will be interpreted as the stop value, (similar to range).

def safe_slice(seq, start, stop=None, step=1):
    if stop is None:
        stop=start
        start=0
    else:
        seq[start]
    if stop<0: stop=len(seq)+stop
    seq[stop-1]        
    return seq[start:stop:step]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I like your way to add simple checks for first and last items before re-using standard slicing. It is far more general than my restricted original safe_slice function. Testing the start index is easy but testing the stop index is a bit tricky. In your example SafeSlice(seq)[:1] will fail. To fix that, I believe we will have to compute last index using start, stop and stride values. – kasyc Nov 10 '11 at 13:07
  • After your last edition, SafeSlice(seq)[5:5] is raising an IndexError but I think (that's arguable) it shouldn't. That aside, this class suits my needs (but does not handle step != 1 cases). safe_slice(0) have a bug because it returns seq[-1]. – kasyc Nov 14 '11 at 11:28
  • Can you provide concrete examples of the behavior you want? Saying `SafeSlice(seq)[5:5]` raises an IndexError is unclear since you haven't said what `seq` is. Saying `safe_slice(0)` has a bug is true -- the first argument to `safe_slice` should be a sequence, not an integer. – unutbu Nov 14 '11 at 13:01
  • I want `SafeSlice(seq)[5:5]` to return `[]` with `seq = [1]`. Please forget what I said about `safe_slice` function in my last comment. Although you mentioned it, I have been tricked by the `range` like interface. – kasyc Nov 14 '11 at 13:31
  • If that is the behavior you want, then perhaps you need to use the code you posted in your original question. My version of safe_slice tests the end points `seq[start]` and `seq[stop-1]`. The whole basis for my test is wrong if you want `SafeSlice([1])[5:5]` to return `[]`. – unutbu Nov 14 '11 at 13:57
  • I don't know because I still like your way to test the first and the last items. I have written a full featured SafeSlice class (with step handling !) re-using your ideas, I will post it as an answer. – kasyc Nov 15 '11 at 17:12
2

Here is a more pythonic, more general rewrite of your code:

class ParseError(Exception):
    pass

def safe_slice(data, start, end, exc=IndexError):
    """0 <= start <= end is assumed"""
    r = data[start:end]
    if len(r) != end - start:
        raise exc()
    return r

def lazy_parse(data):
    """extract (name, phone) from a data buffer.
    If the buffer could not be parsed, a ParseError is raised."""
    results = []
    ptr = 0
    while ptr < len(data):
        length = ord(data[ptr])
        ptr += 1
        results.append(safe_slice(data, ptr, ptr + length, exc=ParseError))
        ptr += length
    return tuple(results)

if __name__ == '__main__':
    print lazy_parse("\x04Jack\x0A0123456789") # OK
    print lazy_parse("\x04Jack\x0A012345678") # should raise ParseError

Most of the changes are in the body of lazy_parse -- it will now work with multiple values instead of just two, and the correctness of the whole thing still depends on the last element being able to be parsed out exactly.

Also, rather than have safe_slice raise an IndexError which lazy_parse changes into a ParseError, I have lazy_parse give the desired exception to safe_slice to be raised in case of error (lazy_parse defaults to IndexError if nothing is passed to it).

Finally, lazy_parse isn't -- it's processing the entire string at once and returning all the results. 'Lazy' in Python means doing only what is needed to return the next piece. In the case of lazy_parse it would mean returning the name, then on a later call returning the phone. With only a slight modification we can make lazy_parse lazy:

def lazy_parse(data):
    """extract (name, phone) from a data buffer.
    If the buffer could not be parsed, a ParseError is raised."""
    ptr = 0
    while ptr < len(data):
        length = ord(data[ptr])
        ptr += 1
        result = (safe_slice(data, ptr, ptr + length, ParseError))
        ptr += length
        yield result

if __name__ == '__main__':
    print list(lazy_parse("\x04Jack\x0A0123456789")) # OK
    print list(lazy_parse("\x04Jack\x0A012345678")) # should raise IndexError

lazy_parse is now a generator that returns one piece at a time. Notice that we had to put list() around the lazy_parse call in the main section get lazy_parse to give us all the results in order to print them.

Depending on what you're doing this might not be the desired way, however, as it can be more difficult to recover from errors:

for item in lazy_parse(some_data):
    result = do_stuff_with(item)
    make_changes_with(result)
    ...

By the time the ParseError is raised you may have made changes that are difficult or impossible to back out. The solution in a case like this would be to do the same as we did in the print part of main:

for item in list(lazy_parse(some_data)):
    ...

The list call completely consumes lazy_parse and gives us a list of the results, and if an error was raised we'll know about it before we process the first item in the loop.

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • You are right about Python lazy terminology. I should not have called lazy_parse that way because a generator is not what I want for the reasons you said. Except for the generator part, your response is better than my original code but not that different for the safe_slice part. I am working on a full featured SafeSlice class rewrite I will post as an answer. – kasyc Nov 15 '11 at 17:09
2

Here is a complete SafeSlice class re-using https://stackoverflow.com/users/107660/duncan and https://stackoverflow.com/users/190597/unutbu answers. The class is quite big because it have full slice support (start, stop and step). This may be overkill for the simple job done in the example but for a more complete real life problem, it might prove useful.

from __future__ import division
from collections import MutableSequence
from collections import namedtuple
from math import ceil

class ParseError(Exception):
    pass

Details = namedtuple('Details', 'name phone')

def parse_details(data):
    safe_data = SafeSlice(bytearray(data)) # because SafeSlice expects a mutable object
    try:
        name_length = safe_data.pop(0)
        name = safe_data.popslice(slice(name_length))
        phone_length = safe_data.pop(0)
        phone = safe_data.popslice(slice(phone_length))
    except IndexError:
        raise ParseError()
    if safe_data:
        # safe_data should be empty at this point
        raise ParseError()
    return Details(name, phone)

def main():
    print parse_details("\x04Jack\x0A0123456789") # OK
    print parse_details("\x04Jack\x0A012345678") # should raise ParseError

SliceDetails = namedtuple('SliceDetails', 'first last length')

class SafeSlice(MutableSequence):
    """This implementation of a MutableSequence gives IndexError with invalid slices"""
    def __init__(self, mutable_sequence):
        self._data = mutable_sequence

    def __str__(self):
        return str(self._data)

    def __repr__(self):
        return repr(self._data)

    def __len__(self):
        return len(self._data)

    def computeindexes(self, ii):
        """Given a slice or an index, this method computes what would ideally be
        the first index, the last index and the length if the SafeSequence was
        accessed using this parameter.

        None indexes will be returned if the computed length is 0.
        First and last indexes may be negative. This means that they are invalid
        indexes. (ie: range(2)[-4:-3] will return first=-2, last=-1 and length=1)
        """
        if isinstance(ii, slice):
            start, stop, step = ii.start, ii.stop, ii.step
            if start is None:
                start = 0
            elif start < 0:
                start = len(self._data) + start
            if stop is None:
                stop = len(self._data)
            elif stop < 0:
                stop = len(self._data) + stop
            if step is None:
                step = 1
            elif step == 0:
                raise ValueError, "slice step cannot be zero"

            length = ceil((stop - start) / step)
            length = int(max(0, length))
            if length:
                first_index = start
                last_index = start + (length - 1) * step
            else:
                first_index, last_index = None, None
        else:
            length = 1
            if ii < 0:
                first_index = last_index = len(self._data) + ii
            else:
                first_index = last_index = ii
        return SliceDetails(first_index, last_index, length)

    def slicecheck(self, ii):
        """Check if the first and the last item of parameter could be accessed"""
        slice_details = self.computeindexes(ii)
        if slice_details.first is not None:
            if slice_details.first < 0:
                # first is *really* negative
                self._data[slice_details.first - len(self._data)]
            else:
                self._data[slice_details.first]
        if slice_details.last is not None:
            if slice_details.last < 0:
                # last is *really* negative
                self._data[slice_details.last - len(self._data)]
            else:
                self._data[slice_details.last]

    def __delitem__(self, ii):
        self.slicecheck(ii)
        del self._data[ii]

    def __setitem__(self, ii, value):
        self.slicecheck(ii)
        self._data[ii] = value

    def __getitem__(self, ii):
        self.slicecheck(ii)
        r = self._data[ii]
        if isinstance(ii, slice):
            r = SafeSlice(r)
        return r

    def popslice(self, ii):
        """Same as pop but a slice may be used as index."""
        self.slicecheck(ii)
        r = self._data[ii]
        if isinstance(ii, slice):
            r = SafeSlice(r)
        del self._data[ii]
        return r

    def insert(self, i, value):
        length = len(self._data)
        if -length <= i <= length:
            self._data.insert(i, value)
        else:
            self._data[i]

if __name__ == '__main__':
    main()
Community
  • 1
  • 1
kasyc
  • 293
  • 1
  • 9