Regex to split on punctuation and spaces, except within brackets

Question

I am trying to tokenize a string, where all punctuation becomes its own token. However, I need to not split up text within brackets.

Example sentence: I want to keep [InsideBrackets], as well as [Inside Brackets], together, while removing other punctuation.

After a while I have come up with this:

re.findall(r"\[?\w+\]?|[^\w\s]",str_here)

Which produces:

['I' , 'want' , 'to' , 'keep' , '[InsideBrackets]' , ',' , 'as' , 'well' , 'as' ,
'[Inside' , 'Brackets]' , ',' , 'together',',','while','removing','other','punctuation','.']

But I haven't figured out how to not split on a space when within brackets. I found several ways to do this but they all broke the punctuation splitting. What change do I need to make?

I gave an example sentence and expected result. Tried the one he marked as duplicate and that doesn't work on this sentence, re.findall just returns a bunch of empty strings. — J.Doe, Jun 30 '19 at 19:56
First of all maybe go on to use `re.split` instead of `findall`. It might be easier to make a pattern for what to take out than what to keep in — Tomerikoo, Jun 30 '19 at 19:59
The top answer in the linked question does keep (Inside Brackets) together, but it removes all punctuation. Is there a way to keep punctuation? — J.Doe, Jun 30 '19 at 20:02
Thank you, that works! Is there a way to combine that with something so I also match all other words and punctuation, such as the code bit I posted? (while keeping [Inside Brackets] together) — J.Doe, Jun 30 '19 at 20:06
Use `re.findall(r"\[[^][]*]|\w+|[^\w\s]",str_here)`, see https://ideone.com/Yx22iX — Wiktor Stribiżew, Jun 30 '19 at 21:53
Nice! You can try this: `(?<=[^\[])\b([a-zA-Z]+\b)(?=[^\]])` for words non within brackets. The first word still missing... Feel free to upvote my comment, will be really appreciated! — sashaboulouds, Jun 30 '19 at 21:54

Regex to split on punctuation and spaces, except within brackets

0 Answers0