Finding data in-between two strings in python

Question

I have a text file which contain some format like :

PAGE(leave) 'Data1'
line 1
line 2 
line 2
...
...
...
PAGE(enter) 'Data1'

I need to get all the lines in between the two keywords and save it a text file. I have come across the following so far. But I have an issue with single quotes as regular expression thinks it as the quote in the expression rather than the keyword.

My codes so far:

log_file = open('messages','r')
    data = log_file.read()
    block = re.compile(ur'PAGE\(leave\) \'Data1\'[\S ]+\s((?:(?![^\n]+PAGE\(enter\) \'Data1\').)*)', re.IGNORECASE | re.DOTALL)
    data_in_home_block=re.findall(block, data)
    file = 0
    make_directory("home_to_home_data",1)
    for line in data_in_home_block:
        file = file + 1
        with open("home_to_home_" + str(file) , "a") as data_in_home_to_home:
            data_in_home_to_home.write(str(line))

It would be great if someone could guide me how to implement it..

so your file actually contains a backslash before the parenthesis? Like `\(`? — Savir, Dec 07 '14 at 23:55
Why use regex at all if the keywords are not variable? Just look for them, get their locations in the text, then retrieve what's between. — Joan Charmant, Dec 07 '14 at 23:55

score 1 · Answer 1 · answered Dec 08 '14 at 01:28

As pointed out by @JoanCharmant, it is not necessary to use regex for this task, because the records are delimited by fixed strings.

Something like this should be enough:

messages = open('messages').read()

blocks = [block.rpartition(r"PAGE\(enter\) 'Data1'")[0]
          for block in messages.split(r"PAGE\(leave\) 'Data1'")
          if block and not block.isspace()]

for count, block in enumerate(blocks, 1):
    with open('home_to_home_%d' % count, 'a') as stream:
        stream.write(block)

score 0 · Accepted Answer · edited May 23 '17 at 11:57

If it's single quotes what worry you, you can start the regular expression string with double quotes...

'hello "howdy"'  # Correct
"hello 'howdy'"  # Correct

Now, there are more issues here... Even when declared asr, you still must escape your regular expression's backslashes in the .compile (see What does the "r" in pythons re.compile(r' pattern flags') mean? ) Is just that without the r, you probably would need a lot more of backslashes.

I've created a test file with two "sections":

PAGE\(leave\) 'Data1'
line 1
line 2 
line 3
PAGE\(enter\) 'Data1'

PAGE\(leave\) 'Data1'
line 4
line 5 
line 6
PAGE\(enter\) 'Data1'

The code below will do what you want (I think)

import re

log_file = open('test.txt', 'r')
data = log_file.read()
log_file.close()
block = re.compile(
    ur"(PAGE\\\(leave\\\) 'Data1'\n)"
    "(.*?)"
    "(PAGE\\\(enter\\\) 'Data1')",
    re.IGNORECASE | re.DOTALL | re.MULTILINE
)
data_in_home_block = [result[1] for result in re.findall(block, data)]
for data_block in data_in_home_block:
    print "Found data_block: %s" % (data_block,)

Outputs:

Found data_block: line 1
line 2 
line 3

Found data_block: line 4
line 5 
line 6

Finding data in-between two strings in python

2 Answers2