14

I have a text file in the following format:

DELIMITER1
extract me
extract me
extract me
DELIMITER2

I'd like to extract every block of extract mes between DELIMITER1 and DELIMITER2 in the .txt file

This is my current, non-performing code:

import re
def GetTheSentences(file):
     fileContents =  open(file)
     start_rx = re.compile('DELIMITER')
     end_rx = re.compile('DELIMITER2')

     line_iterator = iter(fileContents)
     start = False
     for line in line_iterator:
           if re.findall(start_rx, line):

                start = True
                break
      while start:
           next_line = next(line_iterator)
           if re.findall(end_rx, next_line):
                break

           print next_line

           continue
      line_iterator.next()

Any ideas?

Brent Newey
  • 4,479
  • 3
  • 29
  • 33
Renklauf
  • 971
  • 1
  • 12
  • 27

4 Answers4

29

You can simplify this to one regular expression using re.S, the DOTALL flag.

import re
def GetTheSentences(infile):
     with open(infile) as fp:
         for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
             print result
# extract me
# extract me
# extract me

This also makes use of the non-greedy operator .*?, so multiple non-overlapping blocks of DELIMITER1-DELIMITER2 pairs will all be found.

Brent Newey
  • 4,479
  • 3
  • 29
  • 33
5

If the delimiters are within a line:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        for line in file_contents:
            i1, i2 = line.find(d1), line.find(d2)
            if -1 < i1 < i2:
                yield line[i1+1:i2]


sentences = list(get_sentences('path/to/my/file'))

If they are on their own lines:

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

sentences = list(get_sentences('path/to/my/file'))
agf
  • 171,228
  • 44
  • 289
  • 238
  • Traceback (most recent call last): File "", line 1, in File "", line 10, in get_sentences UnboundLocalError: local variable 'results' referenced before assignment – amadain Aug 29 '13 at 09:51
  • @amadain I added a line to initialize results, but looking at this I'm not sure it's correct anyway. – agf Aug 29 '13 at 16:01
2

This should do what you want:

import re
def GetTheSentences(file):
    start_rx = re.compile('DELIMITER')
    end_rx = re.compile('DELIMITER2')

    start = False
    output = []
    with open(file, 'rb') as datafile:
         for line in datafile.readlines():
             if re.match(start_rx, line):
                 start = True
             elif re.match(end_rx, line):
                 start = False
             if start:
                  output.append(line)
    return output

Your previous version looks like it's supposed to be an iterator function. Do you want your output returned one item at a time? That's slightly different.

Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73
  • 1
    There is no need to read the whole file into memory. You also don't need regular expressions if it's something as simple as finding specific substring in a line. – agf Aug 17 '11 at 19:56
  • @agf Of course not, but his simplistic example may not exactly correspond with his data. I've done a very similar thing over a postscript file, and I absolutely had to have regular expressions for my start and end points. – Spencer Rathbun Aug 17 '11 at 20:11
  • @Renklauf no problem, that's what we're here for. You may want to pick one as the answer though... – Spencer Rathbun Aug 19 '11 at 13:25
0

This is a good job for List comprehensions, no regex required. First list comp scrubs the typical \n in the text line list found when opening txt file. Second list comp just uses in operator to identify sequence patterns to filter.

def extract_lines(file):
    scrubbed = [x.strip('\n') for x in open(file, 'r')]
    return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]
cheekybastard
  • 5,535
  • 3
  • 22
  • 26