-1

I need to get the title of links from a webpage. The links may look like

< a href="http://xxxx">Some text< /a>

or

< a href="http://xxxx"><div> < image> < /image> < div> < /a>

there maybe other link which u can image, but the most common two I have is as these two. I add some space to let the page do not consider it as link.

I need to get all the some text part. msg is the code of a webpage. I have wrote the code as

titleregex=re.compile('<a\s*href="http.*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))

The code sucefully dealing with the first type link but not the second type. Any one can help me to delete all <xxx>?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156

2 Answers2

0

Use this Pattern

href\s*=\s*\"*[^\">]*

And flag

re.IGNORECASE, re.I  
re.MULTILINE, re.M

Refere this URL sure help you

Community
  • 1
  • 1
Jaykumar Patel
  • 26,836
  • 12
  • 74
  • 76
0

You need to escape the quotes properly.

>>> import re
>>> s = """< a href="http://xxxx"><div> < image> < /image> < div> < /a>
... < a href="http://xxxx">Some text< /a>"""
>>> re.findall(r"< a\s*href=['\"]http.*?['\"][^<>]*>([^<>]*)<\s*/a>", s)
['Some text']

OR

Seems like you're trying to remove all the tags.

>>> s = '< a href="http://xxxx">Some text< /a>'
>>> re.sub(r'<[^<>]*>', r'', s)
'Some text'
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274