0

I am trying to tokenize a string, where all punctuation becomes its own token. However, I need to not split up text within brackets.

Example sentence: I want to keep [InsideBrackets], as well as [Inside Brackets], together, while removing other punctuation.

After a while I have come up with this:

re.findall(r"\[?\w+\]?|[^\w\s]",str_here)

Which produces:

['I' , 'want' , 'to' , 'keep' , '[InsideBrackets]' , ',' , 'as' , 'well' , 'as' ,
'[Inside' , 'Brackets]' , ',' , 'together',',','while','removing','other','punctuation','.']

But I haven't figured out how to not split on a space when within brackets. I found several ways to do this but they all broke the punctuation splitting. What change do I need to make?

J.Doe
  • 749
  • 1
  • 5
  • 9
  • Can you give an example string? – sashaboulouds Jun 30 '19 at 19:36
  • I gave an example sentence and expected result. Tried the one he marked as duplicate and that doesn't work on this sentence, re.findall just returns a bunch of empty strings. – J.Doe Jun 30 '19 at 19:56
  • First of all maybe go on to use `re.split` instead of `findall`. It might be easier to make a pattern for what to take out than what to keep in – Tomerikoo Jun 30 '19 at 19:59
  • The top answer in the linked question does keep (Inside Brackets) together, but it removes all punctuation. Is there a way to keep punctuation? – J.Doe Jun 30 '19 at 20:02
  • 1
    Try this: `re.findall(r"(\[[^\]]+\])",str_here)` – sashaboulouds Jun 30 '19 at 20:05
  • Thank you, that works! Is there a way to combine that with something so I also match all other words and punctuation, such as the code bit I posted? (while keeping [Inside Brackets] together) – J.Doe Jun 30 '19 at 20:06
  • 1
    Use `re.findall(r"\[[^][]*]|\w+|[^\w\s]",str_here)`, see https://ideone.com/Yx22iX – Wiktor Stribiżew Jun 30 '19 at 21:53
  • 1
    Nice! You can try this: `(?<=[^\[])\b([a-zA-Z]+\b)(?=[^\]])` for words non within brackets. The first word still missing... Feel free to upvote my comment, will be really appreciated! – sashaboulouds Jun 30 '19 at 21:54

0 Answers0