I have two list objects: wiki_text and corpus. wiki_text is made up of small phrases and corpus is made up of long sentences.
wiki_text = ['never ending song of love - ns.jpg',
'ecclesiological society',
"1955-56 michigan wolverines men's basketball team",
'sphinx strix',
'petlas',
'1966 mlb draft',
...]
corpus = ['Substantial progress has been made in the last twenty years',
'Patients are at risk for prostate cancer.',...]
My goal is to create a filter which can filter out elements in wiki_text that is a substring of the elements in corpus. For example, if 'ecclesiological society' exists as part of a sentence in corpus, it should be kept in the final result. The final result should be a subset of the original wiki_text. The following code is what I used before:
def wiki_filter(wiki_text, corpus):
result = []
for i in wiki_text:
for e in corpus:
if i in e:
result.append(i)
break
return result
However, given the length of wiki_text and corpus (each > 10 million). This function took extremely long hours to run. Is there any better way to solve this problem?