Python url data extraction with regex

Question

I'd like to retrieve the content and href link from an HTML tag in Python.

I'm a beginner in regex and am able to retrieve the href content in this way:

urls = re.findall('<a class="title" href="(.*?)" title', page)

When trying to extract tag's content as well, I get nothing.

urls = re.findall('<a class="title" href="(.*?)" title>(.*?)</a>', page)

How to do it the right way?

Thanks in advance.

Did you try using BeautifulSoup?..https://pypi.python.org/pypi/BeautifulSoup — Iron Fist, Dec 29 '15 at 22:29
@KamyarGhasemlou it's not because there, it doesn't care about tag's content. — aajjbb, Dec 29 '15 at 22:31
Is using an html parser feasible for a small snippet like this one ? — aajjbb, Dec 29 '15 at 22:31
do you mean the url with tag's content?(normally, is the tag, so I got a bit confused by your answer) — Kamyar Ghasemlou, Dec 29 '15 at 22:35

score 4 · Accepted Answer · answered Dec 29 '15 at 22:29

4

The right way to do this is use a parser like Beautiful Soup. Trying to parse HTML with regexes is hell and you won't get very far before you hit a wall.

answered Dec 29 '15 at 22:29

Turn

6,656
32
41

Klaus · Answer 2 · 2015-12-29T22:49:01.460

That worked for me to get the URLs from heise.de:

urls = re.findall('<a .*?href="(.*?)".*?>', page)

Perhaps you can express that also simpler.

To retrieve also the Tag content:

urls = re.findall('<a .*?href="(.*?)".*?>(.*?)</a>', page)

I really do not know what this second title does in your regex, perhaps you can also give an example text which does not match. Then I can give you a better answer why your regex does not work

Python url data extraction with regex

2 Answers2