Is there a way to combine two tokens from a list of tokens?

Question

I was wondering if we can add the tokens if there is specific token after the token. For example:

This is a test token and it is a test to see if it works.

In the sentence above let's say we get token as:

token ='This','is', 'a','test','token','and','it','is','a','test','to',see'....

What I want to do is if there is a token called token, I want test token to be single token.

I have looked around and tried everything but I couldn't fix it.

@ vaultah, I was trying to learn regex as a tokenizer. – Sam Oct 09 '15 at 14:50 — Sam, Oct 09 '15 at 14:50

score 2 · Accepted Answer · answered Oct 09 '15 at 14:53

2

Think you mean this,.

>>> import re
>>> s = "This is a test token and it is a test to see if it works."
>>> re.findall(r'\btest token\b|\S+', s)
['This', 'is', 'a', 'test token', 'and', 'it', 'is', 'a', 'test', 'to', 'see', 'if', 'it', 'works.']

answered Oct 09 '15 at 14:53

Avinash Raj

172,303
28
230
274

or `re.findall(r'(?<!\S)test token(?!\S)|\S+', s)` – Avinash Raj Oct 09 '15 at 14:54
This regex works if we know there is word test infront of every token but we don't know if there is word test infront right. – Sam Oct 09 '15 at 14:54
@Sam but is this the appropriate approach, you want the word immediately preceding the target to be part of the same token? – jonrsharpe Oct 09 '15 at 14:57
In this case, you might try the same approach with regex `r'\b(\w+ token|\w+)\b'` – tobias_k Oct 09 '15 at 14:58
I think whether to use `\w` or `\S` depends on whether OP wants to strip or include punctuation, which is not clear from the question, but I would strip it. – tobias_k Oct 09 '15 at 15:01
Thank you AvinashRaj and @tobias_k for your help – Sam Oct 09 '15 at 15:47

Is there a way to combine two tokens from a list of tokens?

1 Answers1

Linked