0

i need to extract href from html documents. most of them has one href so the regex i have solve it but when i have more (following example) , i get the wrong one (email address). Is there a way to extract the href that is not contains email address templates and only starts with 'http://...' ?

The regex i'm using is:

<a\s+(?:[^>]*?\s+)?href={"}([^ {"}]*){"}

The 2 href i have are (need the first one):

<a style='color: black; text-decoration: none; border: 2px solid black; padding: 13px; width: 220px; display: block; text-align: center; margin: 20px 0; font-size: 15px; font-weight: bold;' href='http://ggg.gggg.com/ls/click?upn=ggg'>Verify my account</a>

<a href="mailto:noreply@ggg.com">noreply@ggg.com</a>
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
kobika
  • 1
  • 1
  • 4
    Do not use regex to extract data from HTML. Use a proper HTML/XML parser and get your data. – Aleks G Apr 30 '20 at 11:11
  • 2
    Obligatory link: [**H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ**](https://stackoverflow.com/a/1732454/1954610) – Tom Lord Apr 30 '20 at 11:18
  • 1
    **Lazy** solution: Don't bother trying to cleverly parse the HTML; just use regex to look for "things that look like URLs". For example: [this](https://stackoverflow.com/a/3809435/1954610). **Proper** solution: Don't use regex. Use an HTML parser. – Tom Lord Apr 30 '20 at 11:21
  • thanks, where can i find a guide to do it with html parser ? – kobika Apr 30 '20 at 11:36

2 Answers2

0

Can you try this regex:

/(?!.*\@)http:\/\/.{1,}(?=.\.com).{1,}$/

It basicly excludes @, and obligates to have a .com to match the

  • thanks @Filipe but if i will have @ in the requiered link? maybe i will exclude .com instead since this is what i don't want ? – kobika Apr 30 '20 at 13:12
0

Extract links starting with http. But some links are relative paths, which do not start with http.

reg = '<a[\s]+[^>]*?href[\s]*=[\s\'"]*(?P<url>http.*?)[\'"\s>]'

You can also use lxml, BeautifulSoup, SimplifiedDoc and other libraries to extract data. Here is an example.

from simplified_scrapy import SimplifiedDoc
html = '''
<a style='color: black; text-decoration: none; border: 2px solid black; padding: 13px; width: 220px; display: block; text-align: center; margin: 20px 0; font-size: 15px; font-weight: bold;' href='http://ggg.gggg.com/ls/click?upn=ggg'>Verify my account</a>
<a href="mailto:noreply@ggg.com">noreply@ggg.com</a>
'''
doc = SimplifiedDoc(html)
lst = doc.selects('a').notContains('mailto:',attr='href').href
print(lst)

Result:

['http://ggg.gggg.com/ls/click?upn=ggg']

Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

dabingsou
  • 2,469
  • 1
  • 5
  • 8