0

I'd like to retrieve the content and href link from an HTML tag in Python.

I'm a beginner in regex and am able to retrieve the href content in this way:

urls = re.findall('<a class="title" href="(.*?)" title', page)

When trying to extract tag's content as well, I get nothing.

urls = re.findall('<a class="title" href="(.*?)" title>(.*?)</a>', page)

How to do it the right way?

Thanks in advance.

thatWiseGuy
  • 384
  • 2
  • 3
  • 18
aajjbb
  • 91
  • 1
  • 9

2 Answers2

4

The right way to do this is use a parser like Beautiful Soup. Trying to parse HTML with regexes is hell and you won't get very far before you hit a wall.

Turn
  • 6,656
  • 32
  • 41
2

That worked for me to get the URLs from heise.de:

urls = re.findall('<a .*?href="(.*?)".*?>', page)

Perhaps you can express that also simpler.

To retrieve also the Tag content:

urls = re.findall('<a .*?href="(.*?)".*?>(.*?)</a>', page)

I really do not know what this second title does in your regex, perhaps you can also give an example text which does not match. Then I can give you a better answer why your regex does not work

Klaus
  • 426
  • 5
  • 8