0

I have a file test.txt with this content:

CAR
one a. , z.
two b.
three c.
AIRPLANE
one a. , z.
two b.
three c.
BOAT
one a. , z.
two b.

I want to extract everything from CAR up to but not including AIRPLANE, and write that into output.txt. This regex gives me the everything I need in the capture group:

r"(CAR.*)AIRPLANE"s. link: https://regex101.com/r/QJMJFh/1

To test my input test.txt is entering the program, I do this:

s = open('test.txt')
s_content = s.read()
print(s_content)

It succeeds and produces this:

CAR
one a. , z.
two b.
three c.
AIRPLANE
one a. , z.
two b.
three c.
BOAT
one a. , z.
two b.

However, when I run this:

s_output = re.search(r"(CAR.*)AIRPLANE"s, s_content).group(1)
print(s_output)

It fails and says

  Cell In[85], line 4
    s_output = re.search(r"(CAR.*)AIRPLANE"s, s_content).group(1)
                         ^
SyntaxError: invalid syntax. Perhaps you forgot a comma?

How else can I regex match extract a capture group from this file using re module?

This question is very similar and in fact I used it as the basis of my code. However, my regex was different than what is in that example, and required different flags on re.search.

  • 1
    Does this answer your question? [Match text between two strings with regular expression](https://stackoverflow.com/questions/32680030/match-text-between-two-strings-with-regular-expression) – InSync Jul 15 '23 at 04:00
  • (That might not be the best duplicate target.) Use either `flags=re.S` or the inline modifier `(?s)`. See [this answer](https://stackoverflow.com/a/45981809) for more details. TLDR: [`(?s)CAR.*?(?=AIRPLANE)`](https://regex101.com/r/KTfmh6/2). – InSync Jul 15 '23 at 04:17
  • Thanks for sharing. I checked the first link before posting my question but it would fail because I was missing the flag. I actually used that example as the basis of the code I wrote. – student123456 Jul 15 '23 at 19:02

1 Answers1

2

If you want to run your regex in dot all mode, then you should use the flags option in your call to re.search:

s_output = re.search(r'(CAR.*?)AIRPLANE', s_content, flags=re.S).group(1)
print(s_output)

Note also that I am using lazy dot here, to stop at the first occurrence of AIRPLANE. More generally, you might want to use this version, that stops at the nearest occurrence of AIRPLANE or the end of the input, whichever happens first:

s_output = re.search(r'(CAR.*?)(?=\bAIRPLANE|$)', s_content, flags=re.S).group(1)
print(s_output)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • This worked! But I'd like to better understand why. What does the flags option do? is flags=re.s equivalent to the little s at the end of r'(CAR.*?)AIRPLANE's ? – student123456 Jul 15 '23 at 03:50
  • The syntax you are trying to use looks more like JavaScript or maybe PHP, where the flag is added after the regex pattern. But Python's regex flavor doesn't support this. – Tim Biegeleisen Jul 15 '23 at 03:51
  • Could you elaborate on why it looks more like JavaScript or PHP? The tool I used to design the regex https://regex101.com/r/QJMJFh/1 is set to Python's version of regex. Thanks – student123456 Jul 15 '23 at 19:06
  • A regex tool using Python's flavor of regex is not the same thing as actual Python code. – Tim Biegeleisen Jul 16 '23 at 00:42