I want to extract sentences that containing a drug and gene name from 10,000 articles

Question

I want to extract sentences that containing a drug and gene name from 10,000 articles. and my code is

import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize


flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
    #print (txt)
    fr = open (txt, "r")
    tmp = fr.read().strip()
    a = (sent_tokenize(tmp))
    b = (word_tokenize(tmp))
    for c, value in enumerate(a, 1):
        if value.find("SLC22A1") != -1 and value.find("Metformin"):
            print ("Result", value)
            re.findall("\w+\s?[gene]+", a)
        else:
            if value.find("Metformin") != -1 and value.find("SLC22A1"):
                print ("Results", value)
        if value.find("SLC29B2") != -1 and value.find("Metformin"):
            print ("Result", value)

I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."

This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...! Help me making the code for this

Please describe describe exactly what is going wrong. What should happen, and what happens instead? — lenz, Nov 14 '16 at 08:39
With the statement `print(tmp)`, you print everything you read in -- regardless of your searches. If your problem is that you have more `Result` lines in the output than you want, then clarify your question. — alexis, Nov 14 '16 at 09:52
`if value.find("SLC22A1") != -1 and value.find("Metformin")` what do you want to do here? check if "SLC22A1" and "Metformin" is in the value? because if it's that, then it's wrong. — Jean-François Fabre, Nov 14 '16 at 09:54
@Jean-François Fabre actually I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study." — Emma Belladonna, Nov 16 '16 at 07:19
So this value.find statement return a huge amount of sentences as it prints every sentence having a single word (gene or drug) but I want only sentences get printed out that have these two words. — Emma Belladonna, Nov 16 '16 at 07:19
Good that you fixed the `print(tmp)` line, but I still don't believe this is the code you are really running. This line will give you an error since `a` is a list, not as string: `re.findall("\w+\s?[gene]+", a)` — alexis, Nov 16 '16 at 09:30
yes @alexis this code is not working thats how I asked question here, as I don't have any programming background. I am from medical field. Anyhow I had sorted out this issue by sentence boundary detection (\b) in re.findall statement. :) — Emma Belladonna, Nov 22 '16 at 07:13

score 1 · Answer 1 · answered Nov 16 '16 at 09:37

1

You don't show your real code, but the code you have now has at least one mistake that would lead to lots of spurious output. It's on this line:

re.findall("\w+\s?[gene]+", a)

This regexp does not match strings containing gene, as you clearly intended. It matches (almost) any string contains one of the letters g, e or n.

This cannot be your real code, since a is a list and you would get an error on this line-- plus you ignore the results of the findall()! Sort out your question so it reflects reality. If your problem is still not solved, edit your question and include at least one sentence that is part of the output but you do NOT want to be seeing.

answered Nov 16 '16 at 09:37

alexis

48,685
16
101
161

Yes that one is a mistake. Please help me out to making proper statements fixing the issue. As you got my point what I mean to extract. So any other approaches you would suggest me? – Emma Belladonna Nov 16 '16 at 10:17
1

If you can't even copy-paste and proofread your own code, what kind of help do you expect from me? Get your question in order and someone can try to help you. Nobody here likes guessing at what question-askers are actually up to. – alexis Nov 16 '16 at 10:48

score 0 · Answer 2 · answered Nov 16 '16 at 09:17

When you do this:

if value.find("SLC22A1") != -1 and value.find("Metformin"):

You're testing for "SLC22A1 in the string and "Metformin" not at the start of the string (the second part is probably not what you want)

You probably wanted this:

if value.find("SLC22A1") != -1 and value.find("Metformin") != -1:

This find method is error-prone due to its return value and you don't care for the position, so you'd be better off with in.

To test for 2 words in a sentence (possibly case-insensitive for the 2nd occurrence) do like this:

if "SLC22A1" in vlow and "metformin" in value.lower():

Can I try any other options like regular expression etc? – Emma Belladonna Nov 16 '16 at 10:32 — Emma Belladonna, Nov 16 '16 at 10:32

score 0 · Answer 3 · edited May 23 '17 at 12:24

0

I'd take a different approach:

Read in the text file
Split the text file into sentences. Check out https://stackoverflow.com/a/28093215/223543 for a hand-rolled approach to do this. Or you could use the ntlk.tokenizer.punkt module. (Edited after Alexis pointed me in the right direction in the comments below).
Check if I find your key terms in each sentence and print if I do.

As long as your text files are well formatted, this should work.

edited May 23 '17 at 12:24

Community

1
1

answered Nov 16 '16 at 09:48

Narendra Nag

11
3

The OP already uses the nltk, which provides sentence-splitting. What's "brilliant" about hacking your own inferior solution? – alexis Nov 16 '16 at 11:35
You're right. I'm about two weeks in to learning all the Python libraries. Just read up on the nltk.tokenize.punkt module. – Narendra Nag Nov 16 '16 at 14:41
Glad you made the correction, but still your "answer" just tells the OP to do what she is already doing. Next time, ensure you can actually solve the problem before you write an answer. Otherwise it's not an answer. – alexis Nov 16 '16 at 15:51

I want to extract sentences that containing a drug and gene name from 10,000 articles

3 Answers3