I want to extract sentences that containing a drug and gene name from 10,000 articles. and my code is
import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
#print (txt)
fr = open (txt, "r")
tmp = fr.read().strip()
a = (sent_tokenize(tmp))
b = (word_tokenize(tmp))
for c, value in enumerate(a, 1):
if value.find("SLC22A1") != -1 and value.find("Metformin"):
print ("Result", value)
re.findall("\w+\s?[gene]+", a)
else:
if value.find("Metformin") != -1 and value.find("SLC22A1"):
print ("Results", value)
if value.find("SLC29B2") != -1 and value.find("Metformin"):
print ("Result", value)
I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."
This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...! Help me making the code for this