The purpose of my project is to web scrap
a search engine (I chose DuckDuckGo
). To get all the links on the first page and then to enter each one of these links, take the HTML
source code and do a regular expression that will filter all the .onion
websites inside the HTML code.
I will assume from here that we already web scraped the search engine and got all the websites in the first page (My search terms on DuckDuckGo were: dark web ".onion")
From here this is how the code goes (I will details things in the code comments
):
import requests
from bs4 import BeautifulSoup
import urllib.parse
import re
html_data=[]
#This will be the list that will contains the HTML code of
#each website I visit. For example, html_data[0]
#will contain all the html source code of the first website,
#html_data[1] of the second website and so on.
for x in links: #links is the list that contains all the websites that I got from web scraping DuckDuckGo.
data = requests.get(str(x))
html_data.append(data.text)
#Now html_data contains all the html source code of all the websites in links
print("")
print("============================ONIONS================================")
print("")
#Here I pass a regex to filter all the content in each case of the list (so that I get only .onion links)
for x in html_data:
for m in re.finditer(r'(?:https?://)?(?:www)?(\S*?\.onion)\b', x, re.M | re.IGNORECASE):
print(m.group(0))
So my code is working perfectly. But there is one simple problem. The regular expression is not filtering everything correctly. Some of the HTML code get nested with my .onion websites. And also, I often get .onion
alone in the output.
Here is a sample of the output:
href="http://jv7aqstbyhd5hqki.onion
class="external_link">http://jv7aqstbyhd5hqki.onion
href="http://xdagknwjc7aaytzh.onion
data-qt-tooltip="xdagknwjc7aaytzh.onion
">http://xdagknwjc7aaytzh.onion
href="http://sbforumaz7v3v6my.onion
class="external_link">http://sbforumaz7v3v6my.onion
href="http://kpmp444tubeirwan.onion
class="external_link">http://kpmp444tubeirwan.onion
href="http://r5c2ch4h5rogigqi.onion
class="external_link">http://r5c2ch4h5rogigqi.onion
href="http://hbjw7wjeoltskhol.onion
class="external_link">http://hbjw7wjeoltskhol.onion
href="http://khqtqnhwvd476kez.onion
class="external_link">http://khqtqnhwvd476kez.onion
href="http://jahfuffnfmytotlv.onion
class="external_link">http://jahfuffnfmytotlv.onion
href="http://ocu3errhpxppmwpr.onion
class="external_link">http://ocu3errhpxppmwpr.onion
href="http://germanyhusicaysx.onion
data-qt-tooltip="germanyhusicaysx.onion
">http://germanyhusicaysx.onion
href="http://qm3monarchzifkwa.onion
class="external_link">http://qm3monarchzifkwa.onion
href="http://qm3monarchzifkwa.onion
class="external_link">http://qm3monarchzifkwa.onion
href="http://spofoh4ucwlc7zr6.onion
data-qt-tooltip="spofoh4ucwlc7zr6.onion
">http://spofoh4ucwlc7zr6.onion
href="http://nifgk5szbodg7qbo.onion
class="external_link">http://nifgk5szbodg7qbo.onion
href="http://t4is3dhdc2jd4yhw.onion
class="external_link">http://t4is3dhdc2jd4yhw.onion
I would like to know how I can improve this regex
so that I get my .onion
links in the correct format.