1

I have some text data as follows.

{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}

From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance) using the following regular expression:

appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)

I am trying to understand how the above regular expression works.

(?!<br>) is a negative lookahead for <br>

(?:(?!<br>).)+) - what does this mean? Can someone break it down for me. Also, how many capture groups are there in the regular expression.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
liv2hak
  • 14,472
  • 53
  • 157
  • 270

3 Answers3

3

You do not need such a complicated regex to get the title. Use

Title:\s*(.*?)(?=\s*<br/?>)

See demo

We match Title:, then whitespace \s*, then any characters up tp <br/> with (.*?)(?=\s*<br/?>).

As for (?:(?!<br>).)+, it means capture 1 or more characters not followed with <br>. There is an SO post where this construction is explained in detail.

Here is an image from regex101 (go to Regex Debugger tab, then click + on the right) with the visualization what that construction is doing (checks if the next character is <br>, and if not, consumes and backtracks, etc):

enter image description here

As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?!<br>).)+) has 1 capturing (((?:(?!<br>).)+)) and 1 non-capturing ((?:(?!<br>).)) groups.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thanks :) But I am trying to learn.so I would like to know why my regex works as well :) – liv2hak May 21 '15 at 08:19
  • I added the explanation. – Wiktor Stribiżew May 21 '15 at 08:19
  • 1
    Also, please check the link I added. – Wiktor Stribiżew May 21 '15 at 08:22
  • You can also check the visualization of what `(?:(?!
    ).)+` is doing.
    – Wiktor Stribiżew May 21 '15 at 08:30
  • @liv2hak: `.` in `(?:(?!
    ).)+` does not refer to anything, it tells the regex engine to match any character but a newline. However, before matching ("consuming") that character, the regex engine checks if that character is `<`, and if yes, if the next is `b`, and if yes, checks if the next is `r`... and if it finds `
    `, the `.` pattern does not trigger, the character is not consumed. If it is not found, the character is consumed, the engine goes on matching the substring in our input.
    – Wiktor Stribiżew May 21 '15 at 22:12
2

First of all you don't need lookahead here. What you're doing can be done using this simple regex also:

>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"

btw your regex:

Title: ((?:(?!<br>).)+)

is using a negative lookahead (?!<br>) which checks presence of <br> before matching character after literal text Title:.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

What ((?:(?!<br>).)+) means is:

((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1

((?:(?!<br>).)+)
 ^... Match the regex (non capturing group)

((?:(?!<br>).)+)
    ^... Assert that it is not possible to match the regex <br>

((?:(?!<br>).)+)
            ^... Match a single character, that is not a line break character 

((?:(?!<br>).)+)
              ^... Between one and unlimmited times
Andie2302
  • 4,825
  • 4
  • 24
  • 43