extract certain paragraph from text

Question

I'm new to python, and I have a question. I have several text files, and I would like to extract the CONCLUSION part of each file.
The text files looks like this:

RESULTS: In adjusted analyses, doubling the hourly PAC count was associated with a significant increase in AF risk (hazard ratio, 1.17 [95% CI, 1.13 to 1.22]
LIMITATION: This study does not establish a causal link between PACs and AF.
CONCLUSION: The addition of PAC count to a validated AF risk algorithm provides superior AF risk discrimination and significantly improves risk reclassification. Further study is needed to determine whether PAC modification can prospectively reduce AF risk.
PRIMARY FUNDING SOURCE: American Heart Association, Joseph Drown Foundation, and National Institutes of Health.

And I have multiple files in the same folder, how to do the same with all the files in this folder?
Thank you in advance!

Is the CONCLUSION always a single paragraph, or might there be more than one newline in it? — MattDMo, Mar 19 '14 at 18:55
It's one single paragraph, but with more than one newline in it. In my example, it has three newline. @MattDMo — lgxqzz, Mar 19 '14 at 19:23

Taxellool · Accepted Answer · 2014-03-19T20:20:17.007

2

I'm not good at regex, and not so sure if it's the best way, but it works :)

import os
import re
path = 'path/to/your/files/'
for i in os.listdir(path):
    with open(path+i) as f:
        content = f.read()
        pattern = re.compile('CONCLUSION:\s*([\s\w.]*)\n[A-Z\s]*:')
        print pattern.findall(content)[0]

edited Mar 19 '14 at 20:20

answered Mar 19 '14 at 19:12

Taxellool

4,063
4
21
38

Thanks. But the CONCLUSION part has multiple lines, this code only returns the first line. Do you know how to include all lines in CONCLUSION? @Taxellool – lgxqzz Mar 19 '14 at 19:26
does 'PRIMARY FUNDING SOURCE:' paragragh always come after 'CONCLUSION:'? – Taxellool Mar 19 '14 at 19:36
No,sometimes it will just have a newline with PMID: 23629735 [PubMed - indexed for MEDLINE] @Taxellool – lgxqzz Mar 19 '14 at 19:41
Not sure why the answer received a -1. It was close to what was expected and the new line misunderstanding was more of a confusion in the quesiton – Izaaz Yunus Mar 19 '14 at 19:44
Thanks, and the newline always start with uppercase word. @Taxellool – lgxqzz Mar 19 '14 at 19:44

anon582847382 · Answer 2 · 2014-10-31T21:36:22.670

You should use regular expressions to extract the data that you need:

import re
import os, os.path

PATH = 'path/to/your/files/'

conclusions = []
for file in os.listdir(path):
    with open(os.path.join(PATH, file)) as f:
        data = f.read()

    conclusion = re.search('CONCLUSION: (.*?)([A-Z]{2,})', data).group(1)
    conclusions.append(conclusion)

This looks for the 'CONCLUSION: ' header and then scans for the data after that, stopping after the next heading which will always be a capital word as you specified.

score 0 · Answer 3 · edited May 23 '17 at 12:11

0

This will help you to list all the files in the directory.

Then for each file,

Iterate thru all the lines
See if the current line starts with CONCLUSION:
Do a substring on that line to get all the contents after the word CONCLUSION:

edited May 23 '17 at 12:11

Community

1
1

answered Mar 19 '14 at 19:02

Izaaz Yunus

2,828
1
19
28

extract certain paragraph from text

3 Answers3