How to get href title from a element and delete all string?

Question

I need to get the title of links from a webpage. The links may look like

< a href="http://xxxx">Some text< /a>

or

< a href="http://xxxx"><div> < image> < /image> < div> < /a>

there maybe other link which u can image, but the most common two I have is as these two. I add some space to let the page do not consider it as link.

I need to get all the some text part. msg is the code of a webpage. I have wrote the code as

titleregex=re.compile('<a\s*href="http.*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))

The code sucefully dealing with the first type link but not the second type. Any one can help me to delete all <xxx>?

What's your expected output? Did you want to delete or retrieve? — Avinash Raj, Oct 31 '14 at 05:37
How about a logic, `if after` `>` you find `<` then ignore, but `if after` `>` you find something else get the character till you come across the character `<` — codePG, Oct 31 '14 at 05:42
@AvinashRaj as I mention the I want get all the some text, which is the title of a url. BUt some time when I find a href, there is no "some text", but a image or other thing. I do not know how to get rid of them. — 3414314341, Oct 31 '14 at 05:45
I really like this answer http://stackoverflow.com/a/1732454/4091324 — Darth Kotik, Nov 05 '14 at 15:05

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

Use this Pattern

href\s*=\s*\"*[^\">]*

And flag

re.IGNORECASE, re.I  
re.MULTILINE, re.M

Refere this URL sure help you

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 31 '14 at 05:51

Jaykumar Patel

26,836
12
74
76

Avinash Raj · Accepted Answer · 2014-10-31T06:26:16.973

0

You need to escape the quotes properly.

>>> import re
>>> s = """< a href="http://xxxx"><div> < image> < /image> < div> < /a>
... < a href="http://xxxx">Some text< /a>"""
>>> re.findall(r"< a\s*href=['\"]http.*?['\"][^<>]*>([^<>]*)<\s*/a>", s)
['Some text']

OR

Seems like you're trying to remove all the tags.

>>> s = '< a href="http://xxxx">Some text< /a>'
>>> re.sub(r'<[^<>]*>', r'', s)
'Some text'

edited Oct 31 '14 at 06:26

answered Oct 31 '14 at 06:11

Avinash Raj

172,303
28
230
274

I do not work on my computer, how about just let me know how to delete all as I asked? – 3414314341 Oct 31 '14 at 06:22
use this regex `<[^<>]*>` and replace all the match with empty string. – Avinash Raj Oct 31 '14 at 06:24

How to get href title from a element and delete all string?

2 Answers2

Use this Pattern