Python novice here.
I have a list of documents, and another list of search terms. I would now like to iterate over each document, and replace all occurrences of any of the search terms with something like <placeholder>
. It should, however, only match full words, so text.replace
probably does not work?
So, something like this:
document_list = ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples', 'I like oranges, but not blood oranges.']
search_list = ['apples', 'bananas', 'blood oranges']
Out: ['I like <placeholder>', 'I like <placeholder>', 'I like <placeholder> and <placeholder> and pineapples', 'I like oranges, but not <placeholder>.']
Right now, I have something like
for document in document_list:
for term in search_list:
document = re.sub(r'\b{}\b'.format(term),'<placeholder>',document)
This seems to work, but is really (and I mean really) slow.If I were to run this on my full dataset of ~10k documents, with a search_list of probably ~5k terms, it would take several days to finish. Is there any better way to approach this problem and make it faster?
Thanks a lot in advance!
Edit1: Maybe it's worth mentioning that the terms in search_list
can also consist of multiple words. Edited the example accordingly.
Edit2: Thanks for pointing to the other thread, had not found that one before. Sorry about that. As mentioned below, I'd still be curious to hear other, non-regex solutions just to learn about them. The actual problem has been soved through the other thread, though. =)