How to split a list into sublists based on a separator, similar to str.split()?

Question

Given a list like:

[a, SEP, b, c, SEP, SEP, d]

how do I split it into a list of sublists:

[[a], [b, c], [], [d]]

Effectively I need an equivalent of str.split() for lists. I can hack together something, but I can't seem to be able to come up with anything neat and/or pythonic.

I get the input from an iterator, so a generator working on that is acceptable as well.

More examples:

[a, SEP, SEP, SEP] -> [[a], [], [], []]

[a, b, c] -> [[a, b, c]]

[SEP] -> [[], []]

Huh, wonder how I failed to find the dupe question. But yeah, I want the empty lists too. — Jani, Jan 25 '19 at 20:29
then you should have an empty list at each `sep`? Or only when `sep` occurs twice in a row? what if `sep` occurs 6 times in a row? could you clarify? — d_kennetz, Jan 25 '19 at 20:31
I want it to work exactly like `str.split()`, but for lists. Which means sep gets removed, consecutive seps lead to consecutive empty lists in between. (And it doesn't seem trivial to me to get `itertools.groupby` to do this, IMHO, making this distinct enough not to be a dupe.) — Jani, Jan 25 '19 at 20:36
@Jean-FrançoisFabre based on the updated examples, I agree with OP that this is not a dupe. I don't know if there's a simple `itertools.groupby` solution here. — pault, Jan 25 '19 at 22:06

wim · Accepted Answer · 2020-09-23T21:59:37.083

15

A simple generator will work for all of the cases in your question:

def split(sequence, sep):
    chunk = []
    for val in sequence:
        if val == sep:
            yield chunk
            chunk = []
        else:
            chunk.append(val)
    yield chunk

edited Sep 23 '20 at 21:59

answered Jan 25 '19 at 22:27

wim

338,267
99
616
750

2

This is brilliant. – pault Jan 25 '19 at 22:31
2

Beautiful, simple, easy to understand, pythonic, something I *knew* was possible, but I was unable to come up with myself. Seems to work as I expect. Thanks! – Jani Jan 26 '19 at 12:37

score 2 · Answer 2 · answered Jan 25 '19 at 22:33

2

My first ever Python program :)

from pprint import pprint
my_array = ["a", "SEP", "SEP", "SEP"]
my_temp = []
my_final = []
for item in my_array:
  if item != "SEP":
    my_temp.append(item)
  else:
    my_final.append(my_temp);
    my_temp = []
pprint(my_final);

answered Jan 25 '19 at 22:33

Matthew Page

746
5
15

This seems to have the same elements as in the answer I accepted. It's the genericity and use of generator expression that tipped the scales to the other one. Thanks. – Jani Jan 26 '19 at 12:42
Good call , I gave that one an up vote as well, neat code – Matthew Page Jan 26 '19 at 12:44

score 0 · Answer 3 · answered Jan 25 '19 at 22:17

I am not sure if there's an easy itertools.groupby solution, but here is an iterative approach that should work:

def mySplit(iterable, sep):
    output = []
    sepcount = 0
    current_output = []
    for i, elem in enumerate(iterable):
        if elem != sep:
            sepcount = 0
            current_output.append(elem)
            if (i==(len(iterable)-1)):
                output.append(current_output)
        else:
            if current_output: 
                output.append(current_output)
                current_output = []

            sepcount+=1

            if (i==0) or (sepcount > 1):
                output.append([])
            if (i==(len(iterable)-1)):
                output.append([])

    return output

Testing on your examples:

testLists = [
    ['a', 'SEP', 'b', 'c', 'SEP', 'SEP', 'd'],
    ["a", "SEP", "SEP", "SEP"],
    ["SEP"],
    ["a", "b", "c"]
]

for tl in testLists:
    print(mySplit(tl, sep="SEP"))
#[['a'], ['b', 'c'], [], ['d']]
#[['a'], [], [], []]
#[[], []]
#[['a', 'b', 'c']]

This is analogous to the result you would get if examples were actually strings and you used str.split(sep):

for tl in testLists:
    print("".join(tl).split("SEP"))
#['a', 'bc', '', 'd']
#['a', '', '', '']
#['', '']
#['abc']

By the way, if the elements in your lists were always guaranteed to be strings, you could simply do:

for tl in testLists:
    print([list(x) for x in "".join(tl).split("SEP")])
#[['a'], ['b', 'c'], [], ['d']]
#[['a'], [], [], []]
#[[], []]
#[['a', 'b', 'c']]

But the mySplit() function is more general.

Seems correct to me. The implementation is unnecessarily complicated, but I didn't downvote... — wim, Jan 25 '19 at 23:00

a_guest · Answer 4 · 2019-01-26T18:38:45.100

0

For list or tuple objects you can use the following:

def split(seq, sep):
    start, stop = 0, -1
    while start < len(seq):
        try:
            stop = seq.index(sep, start)
        except ValueError:
            yield seq[start:]
            break
        yield seq[start:stop]
        start = stop + 1
    else:
        if stop == len(seq) - 1:
            yield []

I won't work with a generator but it's fast.

edited Jan 26 '19 at 18:38

answered Jan 25 '19 at 23:26

a_guest

34,165
12
64
118

AFAICT this does not produce the desired results. – Jani Jan 26 '19 at 12:31
@Jani You are right. I suppose you're referring to the case where a `SEP` is at the end of the list? It's not too difficult to account for that case, in form of a final if statement (hence no performance degradation). Please see my updated answer. – a_guest Jan 26 '19 at 18:43
Per quick testing, the updated answer does seem to produce the result I want. However, I still think @wim's answer is the more elegant one. Thanks. – Jani Jan 26 '19 at 18:48
@Jani Sure! You should select whichever solution suits you best. However I'd like to point out that, if you already start with a `list`, this approach can give you a significant speedup. Tested on my machine I got ~ 4x speedup compared to the accepted answer for both small and large as well as sparse and dense lists. – a_guest Jan 26 '19 at 22:00

a_guest · Answer 5 · 2019-01-26T22:04:07.057

0

You can use itertools.takewhile:

def split(seq, sep):
    seq, peek = iter(seq), sep
    while True:
        try:
            peek = next(seq)
        except StopIteration:
            break
        yield list(it.takewhile(sep.__ne__, it.chain((peek,), seq)))
    if peek == sep:
        yield []

The it.chain part is to find out when the seq is exhausted. Note that with this approach it's easy to yield generators instead of lists if desired.

edited Jan 26 '19 at 22:04

answered Jan 25 '19 at 23:37

a_guest

34,165
12
64
118

AFAICT this does not produce the desired results. – Jani Jan 26 '19 at 12:30
@Jani You are right. I suppose you're referring to the case where a `SEP` is at the end of the list? It's not too difficult to account for that case, in form of a final if statement (hence no performance degradation). Please see my updated answer. – a_guest Jan 26 '19 at 18:43
the best approach IMO, I simplified it: https://stackoverflow.com/a/64804147/1161025 (although returns on first empty subsequence) – maciek Nov 12 '20 at 12:45

score 0 · Answer 6 · answered Sep 20 '22 at 19:03

If you prefer a list comprehension, then you can resort to filtering indices and slicing, using itertools.pairwise:

seq = ['a', 'SEP', 'b', 'c', 'SEP', 'SEP', 'd']
[seq[a + 1 : b]
 for (a, b) in itertools.pairwise(
     [-1] + [i for i in range(len(seq)) if seq[i] == 'SEP'] + [len(seq)])]

→

[['a'], ['b', 'c'], [], ['d']]

score -1 · Answer 7 · answered Jan 25 '19 at 22:32

I would define the following function to solve that problem.

l = ['a', 'SEP', 'b', 'c', 'SEP', 'SEP', 'd']

def sublist_with_words(word, search_list):
    res = []
    for i in range(search_list.count(word)):
        index = search_list.index(word)
        res.append(search_list[:index])
        search_list = search_list[index+1:]
    res.append(search_list)
    return res

When I try the cases you gave:

print(sublist_with_words(word = 'SEP', search_list=l))
print(sublist_with_words(word = 'SEP', search_list=['a', 'b', 'c']))
print(sublist_with_words(word = 'SEP', search_list=['SEP']))

The output is:

[['a'], ['b', 'c'], [], ['d']]
[['a', 'b', 'c']]
[[], []]

maciek · Answer 8 · 2020-11-12T12:51:19.663

-1

itertools.takewhile @a_guest's approach simplified:

def split(seq, sep):
    from itertools import takewhile
    iterator = iter(seq)
    while subseq := list(takewhile(lambda x: x != sep, iterator)):
        yield subseq

Please note it returns on first empty subsequence.

edited Nov 12 '20 at 12:51

answered Nov 12 '20 at 12:42

maciek

3,198
2
26
33

score -3 · Answer 9 · answered Jan 25 '19 at 22:26

The following is a non-generic solution that (most probably) only works on list of ints:

import re

def split_list(nums, n):
    nums_str = str(nums)
    splits = nums_str.split(f"{n},")

    patc = re.compile(r"\d+")
    group = []
    for part in splits:
        group.append([int(v) for v in patc.findall(part)])

    return group

if __name__ == "__main__":
    l = [1, 2, 3, 4, 3, 6, 7, 3, 8, 9, 10]
    n = 3
    split_l = split_list(l, n)
    assert split_l == [[1, 2], [4], [6, 7], [8, 9, 10]]

How to split a list into sublists based on a separator, similar to str.split()?

9 Answers9

Linked

Related