I found a good tool on github that lets you enter a URL to extract links from them: https://github.com/devharsh/Links-Extractor
However, I wanted to extract all URLs on the page, not just clickable links, for example, in the site's HTML:
<a href="www.example.com">test</a>
in plaintext HTML: www.example.com
and <img src="www.example.com/picture.png">
would print out:
www.example.com
www.example.com
www.example.com/picture.png
I'm new to python, and I haven't find any online tools that lets you extract URLs from multiple pages (I want it so you enter multiple URLs, and running it will extract all URLs from each URL you've entered), they only allow entering a single URL and extracts links from that page (one at a time).
It only prints out <a href=""></a> HTML tag urls, but not all of them.
Here is the python code (edited to handle UTF-8 and percent encoding):
#!/usr/bin/python
__author__ = "Devharsh Trivedi"
__copyright__ = "Copyright 2018, Devharsh Trivedi"
__license__ = "GPL"
__version__ = "1.4"
__maintainer__ = "Devharsh Trivedi"
__email__ = "devharsh@live.in"
__status__ = "Production"
import sys
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
try:
for link in sys.argv[1:]:
page = requests.get(link)
soup = BeautifulSoup(page.text, "lxml")
extlist = set()
intlist = set()
for a in soup.findAll("a", attrs={"href":True}):
if len(a['href'].strip()) > 1 and a['href'][0] != '#' and 'javascript:' not in a['href'].strip() and 'mailto:' not in a['href'].strip() and 'tel:' not in a['href'].strip():
if 'http' in a['href'].strip() or 'https' in a['href'].strip():
if urlparse(link).netloc.lower() in urlparse(a['href'].strip()).netloc.lower():
intlist.add(a['href'])
else:
extlist.add(a['href'])
else:
intlist.add(a['href'])
print('\n')
print(link)
print('---------------------')
print('\n')
print(str(len(intlist)) + ' internal links found:')
print('\n')
for il in intlist:
print(il.encode("utf-8"))
print('\n')
print(str(len(extlist)) + ' external links found:')
print('\n')
for el in extlist:
print(el.encode("utf-8"))
print('\n')
except Exception as e:
print(e)