I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line
because that would have say that a seed word ring
would have appeared in a doc with the word bring
.
I also want to check whether multiword expressions (MWE) like word with spaces
appears in the document.
I've tried this but this is uber slow, is there a faster way of doing this?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
The desired output should be this:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]