How to extract href starting with

Question

i need to extract href from html documents. most of them has one href so the regex i have solve it but when i have more (following example) , i get the wrong one (email address). Is there a way to extract the href that is not contains email address templates and only starts with 'http://...' ?

The regex i'm using is:

<a\s+(?:[^>]*?\s+)?href={"}([^ {"}]*){"}

The 2 href i have are (need the first one):

<a style='color: black; text-decoration: none; border: 2px solid black; padding: 13px; width: 220px; display: block; text-align: center; margin: 20px 0; font-size: 15px; font-weight: bold;' href='http://ggg.gggg.com/ls/click?upn=ggg'>Verify my account</a>

<a href="mailto:noreply@ggg.com">noreply@ggg.com</a>

Do not use regex to extract data from HTML. Use a proper HTML/XML parser and get your data. — Aleks G, Apr 30 '20 at 11:11
Obligatory link: [**H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ**](https://stackoverflow.com/a/1732454/1954610) — Tom Lord, Apr 30 '20 at 11:18
**Lazy** solution: Don't bother trying to cleverly parse the HTML; just use regex to look for "things that look like URLs". For example: [this](https://stackoverflow.com/a/3809435/1954610). **Proper** solution: Don't use regex. Use an HTML parser. — Tom Lord, Apr 30 '20 at 11:21
thanks, where can i find a guide to do it with html parser ? — kobika, Apr 30 '20 at 11:36

score 0 · Answer 1 · answered Apr 30 '20 at 11:34

0

Can you try this regex:

/(?!.*\@)http:\/\/.{1,}(?=.\.com).{1,}$/

It basicly excludes @, and obligates to have a .com to match the

answered Apr 30 '20 at 11:34

Filipe Costa

1
1

thanks @Filipe but if i will have @ in the requiered link? maybe i will exclude .com instead since this is what i don't want ? – kobika Apr 30 '20 at 13:12

score 0 · Answer 2 · answered May 01 '20 at 01:24

Extract links starting with http. But some links are relative paths, which do not start with http.

reg = '<a[\s]+[^>]*?href[\s]*=[\s\'"]*(?P<url>http.*?)[\'"\s>]'

You can also use lxml, BeautifulSoup, SimplifiedDoc and other libraries to extract data. Here is an example.

from simplified_scrapy import SimplifiedDoc
html = '''
<a style='color: black; text-decoration: none; border: 2px solid black; padding: 13px; width: 220px; display: block; text-align: center; margin: 20px 0; font-size: 15px; font-weight: bold;' href='http://ggg.gggg.com/ls/click?upn=ggg'>Verify my account</a>
<a href="mailto:noreply@ggg.com">noreply@ggg.com</a>
'''
doc = SimplifiedDoc(html)
lst = doc.selects('a').notContains('mailto:',attr='href').href
print(lst)

Result:

['http://ggg.gggg.com/ls/click?upn=ggg']

Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

How to extract href starting with

2 Answers2