find substrings between two string

Question

I have a string like this:


string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title"  width="600"...
title="one more title"...> '''

I am trying to get anything that appears as title (title="Anything here") I have already tried this but it does not work correctly.

re.findall(r'title=\"(.*)\"',string)

The requests library using xpath is probably the way to go: https://pypi.org/project/requests-html/ — Plato77, Feb 12 '20 at 15:50
[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, Feb 12 '20 at 18:03

lwileczek · Accepted Answer · 2020-02-12T17:35:38.057

2

I think your Regex is too Greedy. You can try something like this

re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:

If you would like to read more on performance testing of different python HTML parsers you can learn more here

edited Feb 12 '20 at 17:35

answered Feb 12 '20 at 16:00

lwileczek

2,084
18
27

Thanks this works fine! – Mahhos Feb 12 '20 at 16:16
@mahhos, I'm glad this answer was useful. Please accept answers as correct once your issue has been solved. [Learn how](https://meta.stackexchange.com/questions/23138/how-to-accept-the-answer-on-stack-overflow) – lwileczek Feb 12 '20 at 17:37

score 0 · Answer 2 · answered Feb 12 '20 at 16:39

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help

c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)

for i in c:
    print(i.group(1))

score 0 · Answer 3 · answered Feb 12 '20 at 16:42

0

The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

answered Feb 12 '20 at 16:42

Almostapha

36
5

find substrings between two string

3 Answers3