1

Say I have a set of ~100,000 different numbers. Some are sequential, some are not.

To demonstrate the problem, a small subset of these numbers might be:

(a) {1,2,3,4,5,6,7,8,9,11,13,15,45,46,47,3467}

An efficient way of writing this subset is as follows:

(b) 1:9:1,11:15:2,45:47:1,3467

This is effectively an extended version of python's and matlab's slice notation.

My question is: How can I efficiently obtain a list in the latter notation in Python, from a list of the former type?

I.e., given (a), how can I efficiently obtain (b) in Python?

Kobs
  • 209
  • 3
  • 8

3 Answers3

1

Disclaimer: I misread the question and thought you wanted to go from the slice notation to the set version, this doesn't actually answer your question but I figured it was worth leaving posted. It also seems that numpy._r does the same (or at least very similar) thing.

First off note that if you are using python 3.5+ PEP 3132 gives is an option to use the *unpacking method in set literals:

>>> {*range(1,9), *range(11,15,2), *range(45,47), 3467}
{1, 2, 3, 4, 5, 6, 7, 8, 11, 3467, 13, 45, 46}

Otherwise the notation 11:15:2 is only used when __getitem__ or __setitem__ is used on an object, so you would just need to set up an object that will generate your sets:

def slice_to_range(slice_obj):
    assert isinstance(slice_obj, slice)
    assert slice_obj.stop is not None, "cannot have stop of None"
    start = slice_obj.start or 0
    stop = slice_obj.stop
    step = slice_obj.step or 1
    return range(start,stop,step)

class Slice_Set_Creator:
    def __getitem__(self,item):
        my_set = set()
        for part in item:
            if isinstance(part,slice):
                my_set.update(slice_to_range(part))
            else:
                my_set.add(part)
        return my_set

slice_set_creator = Slice_Set_Creator()

desired_set = slice_set_creator[1:9:1,11:15:2,45:47:1,3467]

>>> desired_set
{1, 2, 3, 4, 5, 6, 7, 8, 11, 3467, 13, 45, 46}
Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • Indeed, I would like to go the other way! Nevertheless, thanks for taking the time for to provide the opposite answer. Might be useful for another user one day. – Kobs Jun 20 '16 at 14:39
  • 1
    Yeah sorry about that, Learned I could make ranges with this notation and got excited I could show it off that I kind of lost track of everything else (blushes) Going from a set to the slices seems like it would have a lot more involved, for one you would probably need the numbers in order which means making the 100 000+ numbers `sorted`. Also how to determine lone numbers vs a slice of two numbers (or would every slice need to be at least 3 numbers?) what about `{2,4,6,7,8,9}` would it be `[2:7:2, 8:10:1]` or `[2:5:2, 7:10:1]` or something else entirely? – Tadhg McDonald-Jensen Jun 20 '16 at 14:46
  • 2:6:2,7:9:1 two to six in strides/increments of 2, seven to nine in strides/increments of 1. As long as any numbers are sequential, even just two, I would use the slice notation. E.g. 1,2,35 would be 1:2:1,35. – Kobs Jun 20 '16 at 14:55
  • 1
    Ok that is what I guessed but I'm glad I asked because that made me realize you are using inclusive endpoints, I just posted an answer that I know works with your test case and a few others but not really sure if it is completely correct. – Tadhg McDonald-Jensen Jun 20 '16 at 15:25
1

I think I got it but the following code was not very thoroughly tested and may contain bugs.

Basically get_partial_slices will try to create partial_slice objects, when the next number in the (sorted) set does not .fit() into the slice it is .end()ed and the next slice is started.

If a slice only has 1 item in it (or 2 items and step!=1) it is represented as separate numbers instead of a slice (hence the need for yield from current.end() since ending the slice may result in two numbers instead of one slice.)

class partial_slice:
    """heavily relied on by get_partial_slices
This attempts to create a slice from repeatedly adding numbers
once a number that doesn't fit the slice is found use .end()
to generate either the slice or the individual numbers"""
    def __init__(self, n):
        self.start = n
        self.stop = None
        self.step = None
    def fit(self,n):
        "returns True if n fits as the next element of the slice (or False if it does not"
        if self.step is None:
            return True #always take the second element into consideration
        elif self.stop == n:
            return True #n fits perfectly with current stop value
        else:
            return False

    def add(self, n):
        """adds a number to the end of the slice, 
    will raise a ValueError if the number doesn't fit"""
        if not self.fit(n):
            raise ValueError("{} does not fit into the slice".format(n))
        if self.step is None:
            self.step = n - self.start
        self.stop = n+self.step

    def to_slice(self):
        "return slice(self.start, self.stop, self.step)"
        return slice(self.start, self.stop, self.step)
    def end(self):
        "generates at most 3 items, may split up small slices"
        if self.step is None:
            yield self.start
            return
        length = (self.stop - self.start)//self.step
        if length>2:
            #always keep slices that contain more then 2 items
            yield self.to_slice()
            return 
        elif self.step==1 and length==2:
            yield self.to_slice()
            return
        else:
            yield self.start
            yield self.stop - self.step


def get_partial_slices(set_):
    data = iter(sorted(set_))
    current = partial_slice(next(data))
    for n in data:
        if current.fit(n):
            current.add(n)
        else:
            yield from current.end()
            current = partial_slice(n)
    yield from current.end()


test_case = {1,2,3,4,5,6,7,8,9,11,13,15,45,46,47,3467}
result = tuple(get_partial_slices(test_case))

#slice_set_creator is from my other answer,
#this will verify that the result was the same as the test case.
assert test_case == slice_set_creator[result] 

def slice_formatter(obj):
    if isinstance(obj,slice):
        # the actual slice objects, like all indexing in python, doesn't include the stop value
        # I added this part to modify it when printing but not when created because the slice 
        # objects can actually be used in code if you want (like with slice_set_creator)
        inclusive_stop = obj.stop - obj.step
        return "{0.start}:{stop}:{0.step}".format(obj, stop=inclusive_stop)
    else:
        return repr(obj)

print(", ".join(map(slice_formatter,result)))
Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • This does precisely what I want. To get it to work in python 2.7 I had to change "yield from current.end()" to "for bar in current.end():yield bar". Currently I always have one step too many, but I should be able to fix this issue fairly easily by myself. Thanks a lot for thinking up this neat solution. – Kobs Jun 20 '16 at 16:13
  • 1
    No problem, and the one step too many is because I am using python slices that do not include the endpoint, you may have noticed that I compensated for that in `slice_formatter` for printing purposes but assuming you leave the actual slice objects as they are you could theoretically use them in actual code. (otherwise yeah you would just change `to_slice`) – Tadhg McDonald-Jensen Jun 20 '16 at 16:17
1

The simplest way is to use numpy's r_[] syntax. So for your example, it would just be:

>>> from numpy import r_
>>>
>>> a = r_[1:10, 11:17:2, 45:48, 3467]

Keep in mind that python slices don't include the last number, and the x:y:1 is implied. And this approach will not be as fast in production code as another, more sophisticated solution, but it is good for interactive use.

You can see that this gives you a numpy array with the numbers you want:

>>> print(a)
[   1    2    3    4    5    6    7    8    9   11   13   15   45   46   47
 3467]
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31