I am trying to tokenize a string, where all punctuation becomes its own token. However, I need to not split up text within brackets.
Example sentence: I want to keep [InsideBrackets], as well as [Inside Brackets], together, while removing other punctuation.
After a while I have come up with this:
re.findall(r"\[?\w+\]?|[^\w\s]",str_here)
Which produces:
['I' , 'want' , 'to' , 'keep' , '[InsideBrackets]' , ',' , 'as' , 'well' , 'as' ,
'[Inside' , 'Brackets]' , ',' , 'together',',','while','removing','other','punctuation','.']
But I haven't figured out how to not split on a space when within brackets. I found several ways to do this but they all broke the punctuation splitting. What change do I need to make?