3

If I have a string, say, "The quick brown fox jumps over the lazy dog", and there's a list [1, 8, 14, 18, 27] indicates where to cut the string.

What I expect to get is a list that contains parts of the cut string. For this example, the output should be:

['T', 'he quic', 'k brow', 'n fo', 'x jumps o', 'ver the lazy dog']

My intuitive and naive way is to simply write a for loop, remember the previous index, slice the string and append the slice to output.

_str="The quick brown fox jumps over the lazy dog"
cut=[1, 8, 14, 18, 27]
prev=0
out=[]
for i in cut:
    out.append(_str[prev:i])
    prev=i
out.append(_str[prev:])

Is there any better way?

timgeb
  • 76,762
  • 20
  • 123
  • 145
YiFei
  • 1,752
  • 1
  • 18
  • 33

3 Answers3

12

Here's how I would do it:

>>> s = "The quick brown fox jumps over the lazy dog"
>>> l = [1, 8, 14, 18, 27]
>>> l = [0] + l + [len(s)]
>>> [s[x:y] for x,y in zip(l, l[1:])]
['T', 'he quic', 'k brow', 'n fo', 'x jumps o', 'ver the lazy dog']

Some explanation:

I'am adding 0 to the front and len(s) to the end of the list, such that

>>> zip(l, l[1:])
[(0, 1), (1, 8), (8, 14), (14, 18), (18, 27), (27, 43)]

gives me a sequence of tuples of slice indices. All that's left to do is unpack those indices in a list comprehension and generate the slices you want.

edit:

If you really care about the memory footprint of this operation, because you deal with very large large strings and lists often of times, use generators all the way and build your list l such that it includes the 0 and len(s) in the first place.

For Python 2:

>>> from itertools import izip, tee
>>> s = "The quick brown fox jumps over the lazy dog"
>>> l = [0, 1, 8, 14, 18, 27, 43]
>>> 
>>> def get_slices(s, l):
...     it1, it2 = tee(l)
...     next(it2)
...     for start, end in izip(it1, it2):
...         yield s[start:end]
... 
>>> list(get_slices(s,l))
['T', 'he quic', 'k brow', 'n fo', 'x jumps o', 'ver the lazy dog']

For Python 3:
zip does what izip did in Python 2 (see Python 3.3 version)

For Python 3.3+ with the yield from syntax:

>>> from itertools import tee
>>> s = "The quick brown fox jumps over the lazy dog"
>>> l = [0, 1, 8, 14, 18, 27, 43]
>>> 
>>> def get_slices(s, l):
...     it1, it2 = tee(l)
...     next(it2)
...     yield from (s[start:end] for start, end in zip(it1, it2))
...     
>>> list(get_slices(s,l))
['T', 'he quic', 'k brow', 'n fo', 'x jumps o', 'ver the lazy dog']
timgeb
  • 76,762
  • 20
  • 123
  • 145
  • I don't think creating multiple copies of data just to get the code down to a couple of lines is ""pythonic" or efficient. – Padraic Cunningham Feb 28 '16 at 10:23
  • @EddoHintoso I think sanitizing the input data should be the job of the caller, but of course you could build in as many sanity checks as you wish. – timgeb Feb 28 '16 at 15:39
  • Your tee example is flawed as you use it as if there is a 0 at the start which there is not, adding one at the start would mean an O(n) insert or creating a new list which you fail to address, also adding manually to an empty list would change the output. – Padraic Cunningham Feb 28 '16 at 20:34
  • @PadraicCunningham I explained that the list should be built in such a way that it includes 0 and `len(s)` in the first place. The functions are expecting a list which explicitly contains all the slicing points - maybe you did miss that. The reason is that now the function works for any kind of sorted input list, for example when you actually don't want to start at the beginning and the end when generating your slices. – timgeb Feb 28 '16 at 21:03
  • @timgeb, that means building or changing the list which all costs something just so tee can be used, I don't understand people's obsession with sacrificing quality just to reduce code by a couple of lines in python. – Padraic Cunningham Feb 28 '16 at 21:13
  • @PadraicCunningham At the danger of repeating myself: my suggestion is to build the list in a way to contain all the slicing points explicitly *when creating it for the first time* such that a) it does not need to be modified later on and b) the generator which gives you the slices works in a more general way. I don't sacrifice quality for that, I am making a suggestion to improve the quality and applicability of the function. You have written your answer, I have written my answer, just leave it be. – timgeb Feb 28 '16 at 21:16
  • If the list comes from an outside source then you don't have control so you have to alter it to make it work, that is my point . Writing a regular generator function works as is albeit a few more lines of code. – Padraic Cunningham Feb 28 '16 at 21:17
  • @PadraicCunningham The outside source should be able to read the documentation that was written for the function. If you can't trust your input data, you need to sanitize it anyway. Guess what, it also crashes if you call it with two integers. – timgeb Feb 28 '16 at 21:21
1

A recursive method:

def split(cut,str): 
    if cut:
        b=cut.pop()
        return split(cut,str[:b])+[str[b:]]
    return [str] 
B. M.
  • 18,243
  • 2
  • 35
  • 54
1

You can do it with a generator function:

def sli(s, inds):
    it = iter(inds)
    p = next(it)
    yield s[:p]
    for i in it:
        yield s[p:i]
        p = i
    yield s[p:]

print(list(sli(_str, cut)))
['T', 'he quic', 'k brow', 'n fo', 'x jumps o', 'ver the lazy dog']

That create a one single list of the slices which can be evaluated lazily.

You also need to consider an empty string being passed unless you want a list of empty strings:

def sli(s, inds):
    if not s:
        return
    it = iter(inds)
    p = next(it)
    yield s[:p]
    for i in it:
        yield s[p:i]
        p = i
    yield s[p:]

On top of being more robust and using less ,memory it is also faster:

Python3:

 l = sorted(random.sample(list(range(5000)), 1000))

 _l = [0] + l + [len(s)]
 [s[x:y] for x,y in zip(_l, _l[1:])]
 ....: 

1000 loops, best of 3: 368 µs per loop

In [39]: timeit list(sli(s, l))
1000 loops, best of 3: 311 µs per loop

Python2:

In [8]: s = "The quick brown fox jumps over the lazy dog"

In [9]: s *= 1000

In [10]: l = sorted(random.sample(list(range(5000)), 1000))

In [11]: %%timeit

_l = [0] + l + [len(s)]
[s[x:y] for x,y in zip(_l, _l[1:])]
....: 
1000 loops, best of 3: 321 µs per loop

In [12]: timeit list(sli(s, l))ched 
1000 loops, best of 3: 204 µs per loop

Writing your own function is perfectly pythonic and in this case more efficient than trying compress the code to a couple of lines.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321