0

Given a list of tokens, I want to replace all the tokens in tokenized text with whitespace.

For example, given ['a', 'is'] and 'this is a test', the result should be 'this test'.

I tried the code from How can I do multiple substitutions using regex in python?, but the output is 'th test'.

Besides, the list is long (about 1k tokens) and the text file is large. so the speed is also important.

shmulvad
  • 636
  • 4
  • 15
  • You are not actually replacing with space, in your sample output you have deleted the token not replaced them with `space` – Zain Arshad Jun 10 '20 at 15:49

1 Answers1

0

This should solve your answer and be reasonable fast. The token list is converted to a set so lookup can be done in O(1) time:

tokens = ['a', 'is']
tokenized_text = 'this is a test'

val = ' '.join(word for word in tokenized_text.split(' ')
               if word not in set(tokens))
print(val)

Prints

this test
shmulvad
  • 636
  • 4
  • 15