-4

Can someone explain to me why one matches but two does not?

Example 1

>>> a = 'Prompt: \n'
>>> b = re.compile('Prompt:[ \t]?(?!\n)')
>>> re.search(b, a)
<_sre.SRE_Match object; span=(0, 7), match='Prompt:'>

Example 2

>>> a = 'Prompt: \n'
>>> b = re.compile('Prompt:[ \t]+(?!\n)')
>>> re.search(b, a)
>>
lpiner
  • 467
  • 4
  • 13
  • `?` makes the string optional. The first one matches because it's not actually going to match the space/tab if `\n` follows it. **Regex *wants* to match**. Some flavours of regex allow the possessive quantifier `?+` such that your pattern becomes `Prompt:[ \t]?+(?!\n)`. Unfortunately, python does not, but this would mitigate this issue. Just change your pattern to `Prompt:(?![ \t]*\n)` – ctwheels Jan 02 '18 at 15:08
  • How enlightening. – lpiner Jan 02 '18 at 15:21
  • Possible duplicate of [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – Jean-François Corbett Jan 02 '18 at 15:59

2 Answers2

1

Brief

As others have mentioned ? makes the string optional. The first pattern matches because it's not actually going to match the space/tab if \n follows it. Regex wants to match something so it will try every iteration of a pattern until it finds a match and that's exactly what it'll return. The second pattern is forcing a match on at least one space character, which doesn't give your regex a way out.

Some flavours of regex allow the possessive quantifier ?+ such that your pattern becomes Prompt:[ \t]?+(?!\n). Unfortunately, python does not, but this would mitigate this issue.


Code

Just change your pattern to the following: See regex in use here

Prompt:(?![ \t]*\n)

Usage

See code in use here

import re

r = re.compile(r"Prompt:(?![ \t]*\n)")

# Doesn't match because no text between Prompt: and \n
s = 'Prompt: \n'
m = r.search(s)
if m:
    print "m: " + m.group(0)

# Matches because text exists between Prompt: and \n
s2 = 'Prompt: Something\n'
m2 = r.search(s2)
if m2:
    print "m2: " + m2.group(0)

Above outputs: m2: Prompt: (which is correct because there's Something before the newline character).

ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • 1
    Seems like I triggered some people. I did not have a very good understanding of lookahead when I posted this. But this is exactly what I needed, thank you. – lpiner Jan 02 '18 at 15:20
0

Your regular expression contains a negative lookahead which specifically rejects any match where the matched string "Prompt: " is followed by a newline.

With [ \t]? there is a way to find a match by not matching the space, so the regex engine chooses that, in its desperate quest to return a match if there is a way to produce one. With [ \t]+ you don't offer a way out, so no match can be found.

It's not entirely clear why you put the assertion there; but removing it certainly allows the string to match as expected and apparently required.

It doesn't really matter here, but common practice is to use raw Python strings r'...' for regular expressions. In your example, having Python replace \t with a literal tab and \n with a literal newline is weird but technically harmless, since those are the actual characters you want to match (and maybe not match, respectively??) but breaks completely with many other backslashed sequences like \s and \d.

To say "there may be whitespace but it cannot be followed by a newline", try someting like

re.compile(r'Prompt:(?![ \t]*\n)')

If you want the space(s) to be included in the match, you can put \s* after the assertion.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I need it to only match if there is no newline. The space may or may not be there depending on the weather that day, so I can't use + and must allow for an optional space or multiple spaces. – lpiner Jan 02 '18 at 15:11
  • You should probably [edit] your question to explain what you actually want to accomplish. Having us play quiz games by guessing why you wrote things this way wastes everybody's time. – tripleee Jan 02 '18 at 15:13