4

Let's say I have a list called split_on_these which I'd like to use split another list, text on. I first pad split_on_these so as to not remove naturally occurring instances of split_on_these entries:

split_on_these = ['iv', 'x', 'v']
text = ["random iv text x hat v", "cat", "dog iv", "random cat x"]
padding = [" " + i + " " for i in split_on_these]

I'm trying to create new_text that splits on all the items contained in padding like so:

["random", "text", "hat", "cat", "dog", "random cat"]

I tried replacing all the entries of text that are contained in padding with some character like ~ and then splitting on that character, but the issue is that when you iterate over the entries in text, sometimes it will be word chunks, and other times it will be individual letters.

Please note that entire chunks preceding a delimiter should be preserved (e.g. random cat).

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

2 Answers2

2

You've already done the "heavy splitting" by padding the divisional words. What you have left is a split-and-filter sequence

text = ["random iv text x hat v", "cat", "dog iv"]
[word for sent in text for word in sent.split() if word not in split_on_these]

This splits your padded sentences into individual words and filters out the unwanted words. Result:

['random', 'text', 'hat', 'cat', 'dog']
Prune
  • 76,765
  • 14
  • 60
  • 81
  • 2
    This won't work for multi-word fragments: processing `['random cat x dog']` will give `['random', 'cat', 'dog']` instead of `['random cat', 'dog']`. – Cireo Sep 10 '19 at 22:34
  • Ah, good point. Do you have a recommendation for multiword fragments? – Parseltongue Sep 10 '19 at 23:14
1

You can use Python's re library. It has a more powerful split function that lets you split on a regex rather than a single character.

You could create a regex that would match any one of your padding strings, as below:

re.split("iv|x|v", text)

The above regex isn't perfect - you'd also have to consider when/whether to match spaces around each padding sequence.

Peritract
  • 761
  • 5
  • 13