0

Python novice here.

I have a list of documents, and another list of search terms. I would now like to iterate over each document, and replace all occurrences of any of the search terms with something like <placeholder>. It should, however, only match full words, so text.replace probably does not work?

So, something like this:

document_list =  ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples', 'I like oranges, but not blood oranges.']
search_list = ['apples', 'bananas', 'blood oranges']

Out: ['I like <placeholder>', 'I like <placeholder>', 'I like <placeholder> and <placeholder> and pineapples', 'I like oranges, but not <placeholder>.']

Right now, I have something like

for document in document_list:
    for term in search_list:
        document = re.sub(r'\b{}\b'.format(term),'<placeholder>',document)

This seems to work, but is really (and I mean really) slow.If I were to run this on my full dataset of ~10k documents, with a search_list of probably ~5k terms, it would take several days to finish. Is there any better way to approach this problem and make it faster?

Thanks a lot in advance!

Edit1: Maybe it's worth mentioning that the terms in search_listcan also consist of multiple words. Edited the example accordingly.

Edit2: Thanks for pointing to the other thread, had not found that one before. Sorry about that. As mentioned below, I'd still be curious to hear other, non-regex solutions just to learn about them. The actual problem has been soved through the other thread, though. =)

1 Answers1

0

This is one possibility:

import re

document_list =  ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples']
search_list = ['apples', 'bananas']

search_re = re.compile(r'\b(' + '|'.join(search_list) + r')\b')
replacement = r'<placeholder>'
document_replaced = [search_re.sub(replacement, doc) for doc in document_list]
print(*document_replaced, sep='\n)

Output:

I like <placeholder>
I like <placeholder>
I like <placeholder> and <placeholder> and pineapples
jdehesa
  • 58,456
  • 7
  • 77
  • 121