-2

I was wondering if we can add the tokens if there is specific token after the token. For example:

This is a test token and it is a test to see if it works.

In the sentence above let's say we get token as:

token ='This','is', 'a','test','token','and','it','is','a','test','to',see'....

What I want to do is if there is a token called token, I want test token to be single token.

I have looked around and tried everything but I couldn't fix it.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Sam
  • 1,206
  • 2
  • 12
  • 27

1 Answers1

2

Think you mean this,.

>>> import re
>>> s = "This is a test token and it is a test to see if it works."
>>> re.findall(r'\btest token\b|\S+', s)
['This', 'is', 'a', 'test token', 'and', 'it', 'is', 'a', 'test', 'to', 'see', 'if', 'it', 'works.']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • or `re.findall(r'(?<!\S)test token(?!\S)|\S+', s)` – Avinash Raj Oct 09 '15 at 14:54
  • This regex works if we know there is word test infront of every token but we don't know if there is word test infront right. – Sam Oct 09 '15 at 14:54
  • @Sam but is this the appropriate approach, you want the word immediately preceding the target to be part of the same token? – jonrsharpe Oct 09 '15 at 14:57
  • In this case, you might try the same approach with regex `r'\b(\w+ token|\w+)\b'` – tobias_k Oct 09 '15 at 14:58
  • I think whether to use `\w` or `\S` depends on whether OP wants to strip or include punctuation, which is not clear from the question, but I would strip it. – tobias_k Oct 09 '15 at 15:01
  • Thank you AvinashRaj and @tobias_k for your help – Sam Oct 09 '15 at 15:47