Splitting a list by matching a regex to an element

Question

I have a list that has some specific elements in it. I would like to split that list into 'sublists' or different lists based on those elements. For example:

test_list = ['a and b, 123','1','2','x','y','Foo and Bar, gibberish','123','321','June','July','August','Bonnie and Clyde, foobar','today','tomorrow','yesterday']

I would like to split into sublists if an element matches 'something and something':

new_list = [['a and b, 123', '1', '2', 'x', 'y'], ['Foo and Bar, gibberish', '123', '321', 'June', 'July', 'August'], ['Bonnie and Clyde, foobar', 'today', 'tomorrow', 'yesterday']]

So far I can accomplish this if there is a fixed amount of items after the specific element. For example:

import re
element_regex = re.compile(r'[A-Z a-z]+ and [A-Z a-z]+')
new_list = [test_list[i:(i+4)] for i, x in enumerate(test_list) if element_regex.match(x)]

Which is almost there, but there's not always exactly three elements following the specific element of interest. Is there a better way than just looping over every single item?

It looks as though you want to split on `'Foo and Bar, gibberish'` but your regex will not match that (it will fail on the comma after Bar). Are you missing single quotation marks anywhere? `'Bonnie and Clyde, foobar'` has the same issue. As for a better method, unless you cannot ever have two matches in a row or there exists some other limitation, you really need to check every entry as it's the potential start of a new list. — adamdc78, Nov 18 '14 at 20:22

Phillip · Accepted Answer · 2014-11-19T09:18:24.530

If you want a one-liner,

new_list = reduce(lambda a, b: a[:-1] + [ a[-1] + [ b ] ] if not element_regex.match(b) or not a[0] else a + [ [ b ] ], test_list, [ [] ])

will do. The python way would however be to use a more verbose variant.

I did some speed measurements on a 4 core i7 @ 2.1 GHz. The timeit module ran this code 1.000.000 times and needed 11.38s for that. Using groupby from the itertools module (Kasras variant from the other answer) requires 9.92s. The fastest variant is the verbose version I suggested, taking only 5.66s:

new_list = [[]]
for i in test_list:
    if element_regex.match(i):
        new_list.append([])
    new_list[-1].append(i)

Albeit not very pythonic, this is what I was looking for. – nfmcclure Nov 19 '14 at 01:00 — nfmcclure, Nov 19 '14 at 01:00

score 2 · Answer 2 · answered Nov 18 '14 at 20:56

2

You dont need regex for that , just use itertools.groupby :

>>> from itertools import groupby
>>> from operator import add
>>> g_list=[list(g) for k,g in groupby(test_list , lambda i : 'and' in i)]
>>> [add(*g_list[i:i+2]) for i in range(0,len(g_list),2)]
[['a and b, 123', '1', '2', 'x', 'y'], ['Foo and Bar, gibberish', '123', '321', 'June', 'July', 'August'], ['Bonnie and Clyde, foobar', 'today', 'tomorrow', 'yesterday']]

first we grouping the list by this lambda function lambda i : 'and' in i that finds the elements that have "and" in it ! and then we have this :

>>> g_list
[['a and b, 123'], ['1', '2', 'x', 'y'], ['Foo and Bar, gibberish'], ['123', '321', 'June', 'July', 'August'], ['Bonnie and Clyde, foobar'], ['today', 'tomorrow', 'yesterday']]

so then we have to concatenate the 2 pairs of lists here that we use add operator and a list comprehension !

answered Nov 18 '14 at 20:56

Mazdak

105,000
18
159
188

Thanks! I went with Phillips answer for the one-liner. But you've convinced me now to read more into itertools. It seems itertools is the answer to a majority of my python questions. – nfmcclure Nov 19 '14 at 01:00
Yes , itertools is a legend in pythons modules ! but about one liner dont be sure that it it faster ! ;) – Mazdak Nov 19 '14 at 05:44
The two variants don't differ much when it comes to speed. On my PC, speedit needs 11.38s for 1M runs using reduce(), and 9.92s for the itertools variant (if also using a regexp). The reason why I'd prefer your variant is better readability. And I think I'd still prefer a `for` loop over both. I'll add something on that to my answer. – Phillip Nov 19 '14 at 09:10
@Phillip when problem is about processing lists i first think about `itertools` ! ;) – Mazdak Nov 19 '14 at 12:12

Splitting a list by matching a regex to an element

2 Answers2

Linked