Extract substring between two different characters using a python regular expression

Question

I'd like to use a python regular expression to extract the substring between two different characters, > and <.

Here are my example strings:

<h4 id="Foobar:">Foobar:</h4>
<h1 id="Monty">Python<a href="https://..."></a></h1>

My current regular expression is \>(.*)\< and matches:

Foobar
Python<a href="https://..."></a>

My re matches the first example correctly but not the second one. I want it to return "Python". What am I missing?

You could try with this: `\>(.*[^>])\<`. Though if you are trying to parse html code, then I would recommend using an html-parsing library instead, like [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ... — game0ver, Aug 13 '18 at 22:10

Paolo · Accepted Answer · 2018-08-13T22:50:35.770

Use expression:

(?<=>)[^<:]+(?=:?<)

(?<=>) Positive lookbehind for >.
[^<:]+ Match anything other than < or :.
(?=:?<) Positive lookahead for optional colon :, and <.

You can try the expression live here.

In Python:

import re
first_string = '<h4 id="Foobar:">Foobar:</h4>'
second_string = '<h1 id="Monty">Python<a href="https://..."></a></h1>'

print(re.findall(r'(?<=>)[^<:]+(?=:?<)',first_string)[0])
print(re.findall(r'(?<=>)[^<:]+(?=:?<)',second_string)[0])

Prints:

Foobar
Python

Alternatively you could use expression:

(?<=>)[a-zA-Z]+(?=\W*<)

(?<=>) Positive lookbehind for >.
[a-zA-Z]+ Lower and upper case letters.
(?=\W*<) Positive lookahead for any non word characters followed by <.

You can test this expression here.

print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',first_string)[0])
print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',second_string)[0])

Prints:

Foobar
Python

score 0 · Answer 2 · answered Aug 13 '18 at 22:12

0

You are missing the greediness of * quantifier - with . it matches as many characters as it can. To switch this quantifier to non-greedy mode add ?:

\>(.*?)\<

You can read more in the documentation in the section *?, +?, ??.

answered Aug 13 '18 at 22:12

Roman Yakubovich

893
1
7
18

This is incorrect. The `:` is matched after the `Foobar` substring. It's also matching `><` too. – Paolo Aug 13 '18 at 22:17
Therefore your solution will not get the desired output, see [here](https://regex101.com/r/oTPv8q/1). – Paolo Aug 13 '18 at 22:18
@UnbearableLightness the author set his task very clearly in the first sentence - to extract the substring between `>` and `<`, which my solution helps to do in a way the author wanted. So either he made a mistake in the first sentence or in the desired output of the first test input. – Roman Yakubovich Aug 13 '18 at 22:23
Your code is still incorrect. `print(re.findall(r'\>(.*?)\<','
Python
'))` prints `['Python', '', '']`, hence my comment regarding the matching of empty strings. – Paolo Aug 13 '18 at 22:30
@UnbearableLightness the author provided his code, I pointed out why it did not work as he was expecting. Dealing with empty strings was not in the question. – Roman Yakubovich Aug 13 '18 at 22:57
OP states : `I want it to return "Python"`. Your expression does not. – Paolo Aug 13 '18 at 23:05
@UnbearableLightness your previous comment shows that it does. All depends on how this expression is used that is out of the author's question, which is emphasizing the problem with the regular expression itself. – Roman Yakubovich Aug 13 '18 at 23:10
It does but it also matches empty strings because of the combination of `.` and greedy quantifier, i.e. your expression is incorrect. – Paolo Aug 13 '18 at 23:15
@UnbearableLightness It matches what was requested, it solves (and the answer explains) the issue the author complained about. Dealing with empty strings and other stuff was not in the question. – Roman Yakubovich Aug 13 '18 at 23:22
The correct version of your expression is `\>([^<>\n]+)\<`. *Dealing with empty strings and other stuff was not in the question.* The whole point of regular expressions is to match some strings instead of other strings, you are matching empty strings too, therefore your pattern is incorrect/incomplete. – Paolo Aug 14 '18 at 09:48

Extract substring between two different characters using a python regular expression

2 Answers2

Python