Efficiently partition a string at arbitrary index

Question

Given an arbitrary string (i.e., not based on a pattern), say:

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

I am trying to partition a string based a list of indexes.

Here is what I tried, which does work:

import string

def split_at_idx(txt, idx):
    new_li=[None]*2*len(idx)
    new_li[0::2]=idx
    new_li[1::2]=[e for e in idx]
    new_li=[0]+new_li+[len(txt)]
    new_li=[new_li[i:i+2] for i in range(0,len(new_li),2)]  
    print(new_li)
    return [txt[st:end] for st, end in new_li]

print(split_at_idx(string.ascii_letters, [3,10,12,40]))  
# ['abc', 'defghij', 'kl', 'mnopqrstuvwxyzABCDEFGHIJKLMN', 'OPQRSTUVWXYZ']

The split is based on a list of indexes [3,10,12,40]. This list then needs to be transformed into a list of start, end pairs like [[0, 3], [3, 10], [10, 12], [12, 40], [40, 52]]. I used a slice assignment to set the evens and odds, then a list comprehension to group into pairs and a second LC to return the partitions.

This seems a little complex for such a simple function. Is there a better / more efficient / more idiomatic way to do this?

I don't understand: you say "partition", but your code and your example *throws away* characters too. For example, the letters `d` and `k` don't appear in your output. Is that really what you want? — Tim Peters, Dec 20 '13 at 04:10
No. My oversight. See edit. Dopey me; I was using this on something with a lot of whitespace and didn't even notice. Thx. — dawg, Dec 20 '13 at 04:15

score 8 · Accepted Answer · answered Dec 20 '13 at 04:16

8

I have a feeling someone asked this question very recently, but I can't find it now. Assuming that the dropped letters were an accident, couldn't you just do:

def split_at_idx(s, idx):
    return [s[i:j] for i,j in zip([0]+idx, idx+[None])]

after which we have

>>> split_at_idx(string.ascii_letters, [3, 10, 12, 40])
['abc', 'defghij', 'kl', 'mnopqrstuvwxyzABCDEFGHIJKLMN', 'OPQRSTUVWXYZ']
>>> split_at_idx(string.ascii_letters, [])
['abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ']
>>> split_at_idx(string.ascii_letters, [13, 26, 39])
['abcdefghijklm', 'nopqrstuvwxyz', 'ABCDEFGHIJKLM', 'NOPQRSTUVWXYZ']

answered Dec 20 '13 at 04:16

DSM

342,061
65
592
494

Perfect! I was playing with zip but just didn't connect the dots. THANKS! – dawg Dec 20 '13 at 04:23
[This](https://stackoverflow.com/questions/1198512/split-a-list-into-parts-based-on-a-set-of-indexes-in-python) might be the question that you were talking about, but it is about lists and not strings, although the solutions proposed are the same, apart from the numpy one. That's because numpy.split() function acts upon lists and not strings. – pgmank Mar 17 '18 at 13:36

David Z · Answer 2 · 2013-12-20T20:32:19.103

1

This seems like a job for itertools.groupby.

def split_at_indices(text, indices):
    [''.join(e[1] for e in g) for k,g in groupby(
      enumerate(text), key=lambda x: bisect_right(indices, x[0])
     )]

You will need to import bisect_right from the bisect module.

This works the way you'd think an efficient implementation should: for each character in the string, it uses binary search in indices to compute a number representing which string in the final list that character should go in, and then groupby separates the characters by those numbers. Though it turns out to be less efficient in most cases, because array access is so quick.

edited Dec 20 '13 at 20:32

answered Dec 20 '13 at 04:24

David Z

128,184
27
255
279

You are an evil genius – Suor Dec 20 '13 at 06:12
Why was this down voted? It is also a great solution, but not as straightforward as `zip` +1 from me. – dawg Dec 20 '13 at 20:13

Efficiently partition a string at arbitrary index

2 Answers2