improve filter links in html code using regex

Question

The purpose of my project is to web scrap a search engine (I chose DuckDuckGo). To get all the links on the first page and then to enter each one of these links, take the HTML source code and do a regular expression that will filter all the .onion websites inside the HTML code.

I will assume from here that we already web scraped the search engine and got all the websites in the first page (My search terms on DuckDuckGo were: dark web ".onion")

From here this is how the code goes (I will details things in the code comments):

import requests
from bs4 import BeautifulSoup
import urllib.parse
import re

html_data=[] 

#This will be the list that will contains the HTML code of 
#each website I visit. For example, html_data[0]
#will contain all the html source code of the first website,
#html_data[1] of the second website and so on.

for x in links: #links is the list that contains all the websites that I got from web scraping DuckDuckGo.
    data = requests.get(str(x))
    html_data.append(data.text)

#Now html_data contains all the html source code of all the websites in links

print("")
print("============================ONIONS================================")
print("")


#Here I pass a regex to filter all the content in each case of the list (so that I get only .onion links)

for x in html_data:
    for m in re.finditer(r'(?:https?://)?(?:www)?(\S*?\.onion)\b', x, re.M | re.IGNORECASE):
        print(m.group(0))

So my code is working perfectly. But there is one simple problem. The regular expression is not filtering everything correctly. Some of the HTML code get nested with my .onion websites. And also, I often get .onion alone in the output.

Here is a sample of the output:

href="http://jv7aqstbyhd5hqki.onion
class="external_link">http://jv7aqstbyhd5hqki.onion
href="http://xdagknwjc7aaytzh.onion
data-qt-tooltip="xdagknwjc7aaytzh.onion
">http://xdagknwjc7aaytzh.onion
href="http://sbforumaz7v3v6my.onion
class="external_link">http://sbforumaz7v3v6my.onion
href="http://kpmp444tubeirwan.onion
class="external_link">http://kpmp444tubeirwan.onion
href="http://r5c2ch4h5rogigqi.onion
class="external_link">http://r5c2ch4h5rogigqi.onion
href="http://hbjw7wjeoltskhol.onion
class="external_link">http://hbjw7wjeoltskhol.onion
href="http://khqtqnhwvd476kez.onion
class="external_link">http://khqtqnhwvd476kez.onion
href="http://jahfuffnfmytotlv.onion
class="external_link">http://jahfuffnfmytotlv.onion
href="http://ocu3errhpxppmwpr.onion
class="external_link">http://ocu3errhpxppmwpr.onion
href="http://germanyhusicaysx.onion
data-qt-tooltip="germanyhusicaysx.onion
">http://germanyhusicaysx.onion
href="http://qm3monarchzifkwa.onion
class="external_link">http://qm3monarchzifkwa.onion
href="http://qm3monarchzifkwa.onion
class="external_link">http://qm3monarchzifkwa.onion
href="http://spofoh4ucwlc7zr6.onion
data-qt-tooltip="spofoh4ucwlc7zr6.onion
">http://spofoh4ucwlc7zr6.onion
href="http://nifgk5szbodg7qbo.onion
class="external_link">http://nifgk5szbodg7qbo.onion
href="http://t4is3dhdc2jd4yhw.onion
class="external_link">http://t4is3dhdc2jd4yhw.onion

I would like to know how I can improve this regex so that I get my .onion links in the correct format.

score 2 · Accepted Answer · answered Oct 03 '18 at 02:47

You could use this regex. It matches the URL for .onion
It works on source html, gets/tests the href attribute of any tag.

You won't need to use regex options, as they are included inline.
What you want is in Capture group 3.

r"(?si)<[\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\s)href\s*=\s*(?:(['\"])\s*(((?!mailto:)(?:(?:https?|ftp)://)?(?:(?:(?!\1)\S)+(?::(?:(?!\1)\S)*)?@)?(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*?\.onion\b)(?:(?!\1).)*?)\s*\1))\s+(?:\".*?\"|'.*?'|[^>]*?)+>"

https://regex101.com/r/oeYCxX/1

Readable version

 (?si)                         # Dot-all and case insensitive modifiers
 < [\w:]+                      # Any tag
 (?=
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      (?<= \s )
      href \s* = \s*                # href attribute
      (?:
           ( ['"] )                      # (1)
           \s* 
           (                             # (2 start), Full url
                (                             # (3 start), The url up to '.onion'
                     (?! mailto: )
                     (?:
                          (?: https? | ftp )
                          ://
                     )?
                     (?:
                          (?:
                               (?! \1 )
                               \S 
                          )+
                          (?:
                               : 
                               (?:
                                    (?! \1 )
                                    \S 
                               )*
                          )?
                          @
                     )?
                     (?:
                          (?: [a-z\u00a1-\uffff0-9] -? )*
                          [a-z\u00a1-\uffff0-9]+ 
                     )
                     (?:
                          \.
                          (?: [a-z\u00a1-\uffff0-9] -? )*
                          [a-z\u00a1-\uffff0-9]+ 
                     )*?
                     \.onion                        \b 
                )                             # (3 end)
                (?:                           # Parameters
                     (?! \1 )
                     . 
                )*?
           )                             # (2 end)
           \s* \1 
      )
 )
 \s+ 
 (?: " .*? " | ' .*? ' | [^>]*? )+
 >

score 0 · Answer 2 · answered Oct 02 '18 at 19:58

0

\S*? is too loose of a pattern for URL matching. It will match as few non-whitespace characters as possible to satisfy the pattern, which includes things like < and >.

For an idea of which characters are valid in a URL, see this answer: Which characters make a URL invalid?

You might be able to get away with something like [^\s<>] instead of \S. [^\s<>] will match any character that's not whitespace or angle braces, rather than matching anything non-whitespace.

answered Oct 02 '18 at 19:58

John

2,395
15
21

Thank you for your suggestion! I tried to replace `\S` with `[^\s<>]` and it remove some of the unwanted parts. But most of them are still here like `href="http://lw4ipk5choakk5ze.onion` I also get a lot of `.onion/.Onion` and even things like `16.http://nr6juudpp4as4gjg.onion` I will look for a better way to solve this. Thank you for the link that you provided ! – Lok Ridgmont Oct 02 '18 at 20:33

improve filter links in html code using regex

2 Answers2