Repeatedly extract a line between two delimiters in a text file, Python

Question

I have a text file in the following format:

DELIMITER1
extract me
extract me
extract me
DELIMITER2

I'd like to extract every block of extract mes between DELIMITER1 and DELIMITER2 in the .txt file

This is my current, non-performing code:

import re
def GetTheSentences(file):
     fileContents =  open(file)
     start_rx = re.compile('DELIMITER')
     end_rx = re.compile('DELIMITER2')

     line_iterator = iter(fileContents)
     start = False
     for line in line_iterator:
           if re.findall(start_rx, line):

                start = True
                break
      while start:
           next_line = next(line_iterator)
           if re.findall(end_rx, next_line):
                break

           print next_line

           continue
      line_iterator.next()

Any ideas?

Brent Newey · Accepted Answer · 2011-08-19T13:19:01.133

29

You can simplify this to one regular expression using re.S, the DOTALL flag.

import re
def GetTheSentences(infile):
     with open(infile) as fp:
         for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
             print result
# extract me
# extract me
# extract me

This also makes use of the non-greedy operator .*?, so multiple non-overlapping blocks of DELIMITER1-DELIMITER2 pairs will all be found.

edited Aug 19 '11 at 13:19

answered Aug 17 '11 at 19:59

Brent Newey

4,479
3
29
33

3

tip: use this with a memory-mapped file object (via the `mmap` module) if your file is too large to read in all at once. – Steven Aug 17 '11 at 20:58
@Brent Tried this out and it functions nicely...Thanks! – Renklauf Aug 19 '11 at 13:08
Glad I could help. Don't forget to mark an answer as accepted if it is the best answer to your question. – Brent Newey Aug 19 '11 at 13:19

agf · Answer 2 · 2013-08-29T16:01:22.720

5

If the delimiters are within a line:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        for line in file_contents:
            i1, i2 = line.find(d1), line.find(d2)
            if -1 < i1 < i2:
                yield line[i1+1:i2]


sentences = list(get_sentences('path/to/my/file'))

If they are on their own lines:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

sentences = list(get_sentences('path/to/my/file'))

edited Aug 29 '13 at 16:01

answered Aug 17 '11 at 19:55

agf

171,228
44
289
238

Traceback (most recent call last): File "", line 1, in File "", line 10, in get_sentences UnboundLocalError: local variable 'results' referenced before assignment – amadain Aug 29 '13 at 09:51
@amadain I added a line to initialize results, but looking at this I'm not sure it's correct anyway. – agf Aug 29 '13 at 16:01

score 2 · Answer 3 · answered Aug 17 '11 at 19:54

2

This should do what you want:

import re
def GetTheSentences(file):
    start_rx = re.compile('DELIMITER')
    end_rx = re.compile('DELIMITER2')

    start = False
    output = []
    with open(file, 'rb') as datafile:
         for line in datafile.readlines():
             if re.match(start_rx, line):
                 start = True
             elif re.match(end_rx, line):
                 start = False
             if start:
                  output.append(line)
    return output

Your previous version looks like it's supposed to be an iterator function. Do you want your output returned one item at a time? That's slightly different.

answered Aug 17 '11 at 19:54

Spencer Rathbun

14,510
6
54
73

1

There is no need to read the whole file into memory. You also don't need regular expressions if it's something as simple as finding specific substring in a line. – agf Aug 17 '11 at 19:56
@agf Of course not, but his simplistic example may not exactly correspond with his data. I've done a very similar thing over a postscript file, and I absolutely had to have regular expressions for my start and end points. – Spencer Rathbun Aug 17 '11 at 20:11
@Renklauf no problem, that's what we're here for. You may want to pick one as the answer though... – Spencer Rathbun Aug 19 '11 at 13:25

score 0 · Answer 4 · answered May 10 '15 at 05:00

0

This is a good job for List comprehensions, no regex required. First list comp scrubs the typical \n in the text line list found when opening txt file. Second list comp just uses in operator to identify sequence patterns to filter.

def extract_lines(file):
    scrubbed = [x.strip('\n') for x in open(file, 'r')]
    return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]

answered May 10 '15 at 05:00

cheekybastard

5,535
3
22
26

1

This returns the entire file except those two lines, not the lines between the delimiters. – tripleee Feb 09 '21 at 08:24

Repeatedly extract a line between two delimiters in a text file, Python

4 Answers4

Linked