0

I have a string like this:


string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title"  width="600"...
title="one more title"...> '''

I am trying to get anything that appears as title (title="Anything here") I have already tried this but it does not work correctly.

re.findall(r'title=\"(.*)\"',string)
arne
  • 4,514
  • 1
  • 28
  • 47
Mahhos
  • 101
  • 1
  • 1
  • 12
  • 3
    Regex is not nice way to parse html. Use html parsers. – Austin Feb 12 '20 at 15:47
  • The requests library using xpath is probably the way to go: https://pypi.org/project/requests-html/ – Plato77 Feb 12 '20 at 15:50
  • [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto Feb 12 '20 at 18:03

3 Answers3

2

I think your Regex is too Greedy. You can try something like this

re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:

If you would like to read more on performance testing of different python HTML parsers you can learn more here

lwileczek
  • 2,084
  • 18
  • 27
  • Thanks this works fine! – Mahhos Feb 12 '20 at 16:16
  • @mahhos, I'm glad this answer was useful. Please accept answers as correct once your issue has been solved. [Learn how](https://meta.stackexchange.com/questions/23138/how-to-accept-the-answer-on-stack-overflow) – lwileczek Feb 12 '20 at 17:37
0

As @Austin and @Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help

c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)

for i in c:
    print(i.group(1))
ibrahim
  • 81
  • 1
  • 9
0

The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

Almostapha
  • 36
  • 5