Split a list based on a lists of unique values

Question

Let's say I have a list called split_on_these which I'd like to use split another list, text on. I first pad split_on_these so as to not remove naturally occurring instances of split_on_these entries:

split_on_these = ['iv', 'x', 'v']
text = ["random iv text x hat v", "cat", "dog iv", "random cat x"]
padding = [" " + i + " " for i in split_on_these]

I'm trying to create new_text that splits on all the items contained in padding like so:

["random", "text", "hat", "cat", "dog", "random cat"]

I tried replacing all the entries of text that are contained in padding with some character like ~ and then splitting on that character, but the issue is that when you iterate over the entries in text, sometimes it will be word chunks, and other times it will be individual letters.

Please note that entire chunks preceding a delimiter should be preserved (e.g. random cat).

It's not clear (to me) how to use the split method to split on a list of separators, rather than a single separator. — Parseltongue, Sep 10 '19 at 21:53
Possible duplicate of [Split Strings into words with multiple word boundary delimiters](https://stackoverflow.com/questions/1059559/split-strings-into-words-with-multiple-word-boundary-delimiters) — Joshua Nixon, Sep 10 '19 at 21:55
Please clarify: if you have a string "random cat x dog", do you want "random cat" as a single string, or as two separate words? — Prune, Sep 10 '19 at 23:19
@Prune Yes. I just edited the question to make this more clear. Apologies, I did not consider this problem in advance, but it is an actual issue. — Parseltongue, Sep 11 '19 at 13:12
Okay; in that case, your better solution will lie with regex. — Prune, Sep 11 '19 at 14:35

Prune · Answer 1 · 2019-09-10T22:50:01.270

2

You've already done the "heavy splitting" by padding the divisional words. What you have left is a split-and-filter sequence

text = ["random iv text x hat v", "cat", "dog iv"]
[word for sent in text for word in sent.split() if word not in split_on_these]

This splits your padded sentences into individual words and filters out the unwanted words. Result:

['random', 'text', 'hat', 'cat', 'dog']

edited Sep 10 '19 at 22:50

answered Sep 10 '19 at 21:56

Prune

76,765
14
60
81

2

This won't work for multi-word fragments: processing `['random cat x dog']` will give `['random', 'cat', 'dog']` instead of `['random cat', 'dog']`. – Cireo Sep 10 '19 at 22:34
Ah, good point. Do you have a recommendation for multiword fragments? – Parseltongue Sep 10 '19 at 23:14

score 1 · Answer 2 · answered Sep 10 '19 at 21:54

You can use Python's re library. It has a more powerful split function that lets you split on a regex rather than a single character.

You could create a regex that would match any one of your padding strings, as below:

re.split("iv|x|v", text)

The above regex isn't perfect - you'd also have to consider when/whether to match spaces around each padding sequence.

Split a list based on a lists of unique values

2 Answers2