I need to get the title of links from a webpage. The links may look like
< a href="http://xxxx">Some text< /a>
or
< a href="http://xxxx"><div> < image> < /image> < div> < /a>
there maybe other link which u can image, but the most common two I have is as these two. I add some space to let the page do not consider it as link.
I need to get all the some text
part. msg
is the code of a webpage. I have wrote the code as
titleregex=re.compile('<a\s*href="http.*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))
The code sucefully dealing with the first type link but not the second type. Any one can help me to delete all <xxx>
?