0

I'd like to use a python regular expression to extract the substring between two different characters, > and <.

Here are my example strings:

  1. <h4 id="Foobar:">Foobar:</h4>
  2. <h1 id="Monty">Python<a href="https://..."></a></h1>

My current regular expression is \>(.*)\< and matches:

  1. Foobar
  2. Python<a href="https://..."></a>

My re matches the first example correctly but not the second one. I want it to return "Python". What am I missing?

Astrodude11
  • 109
  • 2
  • 5
  • 11
  • 1
    You could try with this: `\>(.*[^>])\<`. Though if you are trying to parse html code, then I would recommend using an html-parsing library instead, like [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ... – game0ver Aug 13 '18 at 22:10

2 Answers2

2

Use expression:

(?<=>)[^<:]+(?=:?<)
  • (?<=>) Positive lookbehind for >.
  • [^<:]+ Match anything other than < or :.
  • (?=:?<) Positive lookahead for optional colon :, and <.

You can try the expression live here.

In Python:

import re
first_string = '<h4 id="Foobar:">Foobar:</h4>'
second_string = '<h1 id="Monty">Python<a href="https://..."></a></h1>'

print(re.findall(r'(?<=>)[^<:]+(?=:?<)',first_string)[0])
print(re.findall(r'(?<=>)[^<:]+(?=:?<)',second_string)[0])

Prints:

Foobar
Python

Alternatively you could use expression:

(?<=>)[a-zA-Z]+(?=\W*<)
  • (?<=>) Positive lookbehind for >.
  • [a-zA-Z]+ Lower and upper case letters.
  • (?=\W*<) Positive lookahead for any non word characters followed by <.

You can test this expression here.

print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',first_string)[0])
print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',second_string)[0])

Prints:

Foobar
Python
Paolo
  • 21,270
  • 6
  • 38
  • 69
0

You are missing the greediness of * quantifier - with . it matches as many characters as it can. To switch this quantifier to non-greedy mode add ?:

\>(.*?)\<  

You can read more in the documentation in the section *?, +?, ??.

Roman Yakubovich
  • 893
  • 1
  • 7
  • 18
  • This is incorrect. The `:` is matched after the `Foobar` substring. It's also matching `><` too. – Paolo Aug 13 '18 at 22:17
  • Therefore your solution will not get the desired output, see [here](https://regex101.com/r/oTPv8q/1). – Paolo Aug 13 '18 at 22:18
  • @UnbearableLightness the author set his task very clearly in the first sentence - to extract the substring between `>` and `<`, which my solution helps to do in a way the author wanted. So either he made a mistake in the first sentence or in the desired output of the first test input. – Roman Yakubovich Aug 13 '18 at 22:23
  • Your code is still incorrect. `print(re.findall(r'\>(.*?)\<','

    Python

    '))` prints `['Python', '', '']`, hence my comment regarding the matching of empty strings.
    – Paolo Aug 13 '18 at 22:30
  • @UnbearableLightness the author provided his code, I pointed out why it did not work as he was expecting. Dealing with empty strings was not in the question. – Roman Yakubovich Aug 13 '18 at 22:57
  • OP states : `I want it to return "Python"`. Your expression does not. – Paolo Aug 13 '18 at 23:05
  • @UnbearableLightness your previous comment shows that it does. All depends on how this expression is used that is out of the author's question, which is emphasizing the problem with the regular expression itself. – Roman Yakubovich Aug 13 '18 at 23:10
  • It does but it also matches empty strings because of the combination of `.` and greedy quantifier, i.e. your expression is incorrect. – Paolo Aug 13 '18 at 23:15
  • @UnbearableLightness It matches what was requested, it solves (and the answer explains) the issue the author complained about. Dealing with empty strings and other stuff was not in the question. – Roman Yakubovich Aug 13 '18 at 23:22
  • The correct version of your expression is `\>([^<>\n]+)\<`. *Dealing with empty strings and other stuff was not in the question.* The whole point of regular expressions is to match some strings instead of other strings, you are matching empty strings too, therefore your pattern is incorrect/incomplete. – Paolo Aug 14 '18 at 09:48