2

I found a good tool on github that lets you enter a URL to extract links from them: https://github.com/devharsh/Links-Extractor

However, I wanted to extract all URLs on the page, not just clickable links, for example, in the site's HTML:

<a href="www.example.com">test</a>
in plaintext HTML: www.example.com
and <img src="www.example.com/picture.png">

would print out:

www.example.com
www.example.com
www.example.com/picture.png

I'm new to python, and I haven't find any online tools that lets you extract URLs from multiple pages (I want it so you enter multiple URLs, and running it will extract all URLs from each URL you've entered), they only allow entering a single URL and extracts links from that page (one at a time).

It only prints out <a href=""></a> HTML tag urls, but not all of them.

Here is the python code (edited to handle UTF-8 and percent encoding):

#!/usr/bin/python

__author__ = "Devharsh Trivedi"
__copyright__ = "Copyright 2018, Devharsh Trivedi"
__license__ = "GPL"
__version__ = "1.4"
__maintainer__ = "Devharsh Trivedi"
__email__ = "devharsh@live.in"
__status__ = "Production"

import sys
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

try:

    for link in sys.argv[1:]:
        page = requests.get(link)
        soup = BeautifulSoup(page.text, "lxml")
        extlist = set()
        intlist = set()
        
        for a in soup.findAll("a", attrs={"href":True}):
            if len(a['href'].strip()) > 1 and a['href'][0] != '#' and 'javascript:' not in a['href'].strip() and 'mailto:' not in a['href'].strip() and 'tel:' not in a['href'].strip():
                if 'http' in a['href'].strip() or 'https' in a['href'].strip():
                    if urlparse(link).netloc.lower() in urlparse(a['href'].strip()).netloc.lower():
                        intlist.add(a['href'])
                    else:
                        extlist.add(a['href'])
                else:
                    intlist.add(a['href'])
        
        print('\n')
        print(link)
        print('---------------------')
        print('\n')
        print(str(len(intlist)) + ' internal links found:')
        print('\n')
        for il in intlist:
            print(il.encode("utf-8"))
        print('\n')
        print(str(len(extlist)) + ' external links found:')
        print('\n')
        for el in extlist:
            print(el.encode("utf-8"))
        print('\n')
        
except Exception as e:
    print(e)
vvvvv
  • 25,404
  • 19
  • 49
  • 81
AAsomb113
  • 61
  • 5

1 Answers1

1

Here's a quick regex to identify a URL:

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

In practice, this would look like:

import re
import requests
import sys

def find_urls(links):
  url_list = []
  for link in links:
    page = requests.get(link).text
    parts = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', page)
    true_url = [p + '://' + d + sd for p, d, sd in parts]
    url_list.extend(true_url)
  return url_list

print(find_urls(sys.argv[1:]))

The output for:

print(find_urls(['https://www.google.com']))

is:

['http://schema.org/WebPage', 'https://www.google.com/imghp?hl=en&tab=wi', 'https://maps.google.com/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'https://www.youtube.com/?gl=US&tab=w1', 'https://news.google.com/nwshp?hl=en&tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'https://www.google.com/intl/en/about/products?tab=wh', 'http://www.google.com/history/optout?hl=en', 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/']

Thanks to Rajeev here for the regex

Edit: Given the author's updated use case, I did some trial and error and found this new regex:

((https?:\/\/.+)?(\/.*)+)

Here it is in practice:

def find_urls(links):
  url_list = []
  for link in links:
    page = requests.get(link).text
    parts = re.findall('((https?:\/\/.+)?(\/.*)+)', page)
    url_list.extend(parts)
  return url_list

I won't guarantee this will work for every use case (I'm no regex expert), but it should work for most URLs/file paths you'll find in web pages

  • Nice, I would like that to print text as a column and not a row. Thanks. – AAsomb113 Jul 22 '19 at 22:00
  • I've found a flaw, testing with this URL on tumblr: https://yama252527.tumblr.com/ it also prints a URL in a URL: https://www.tumblr.com/oembed/1.0?url=https://yama252527.tumblr.com/post/173227396788/%E3%82%AD%E3%83%AC%E3%82%A4%E3%83%8F%E3%83%8A-%E3%83%A6%E3%82%AB%E3%83%AA-21%E6%AD%B3-160-%E3%81%A9%E3%81%86%E3%81%97%E3%81%9F%E3%81%AE%E7%94%98%E3%81%88%E3%81%9F%E3%81%8F%E3%81%AA%E3%81%A3%E3%81%A1%E3%82%83%E3%81%A3%E3%81%9F – AAsomb113 Jul 22 '19 at 22:04
  • For your first comment, you could just do `for url in find_urls(...): print(url);` instead. Secondly, what are you expecting for that url? I think that's just how tumblr does that link. – maninthecomputer Jul 22 '19 at 23:25
  • First question: Do I replace only this last part?: print(find_urls(sys.argv[1:])) For the second question: Looks like when viewing it on notepad++ (or any notepad other than MS), it have both version with one without the “pre-pend” on the URL. That issue solved. – AAsomb113 Jul 23 '19 at 02:40
  • First question: Yes. – maninthecomputer Jul 23 '19 at 15:00
  • Sadly stackoverflow does not allow non-inline code formatting for comments, I replace the very last single line with [for url in find_urls(...): print(url);] and it errored out. – AAsomb113 Jul 23 '19 at 22:23
  • Replace `...` with `sys.argv[1:]`. I meant the ellipses as a placeholder. – maninthecomputer Jul 23 '19 at 22:24
  • ACK! When extracting links from this URL: https://www.uchinokomato.me/user/show/26774 it does not extract links to the individual posts (example: https://www.uchinokomato.me/chara/show/183475 ) Looking at the page's source code, it takes the relative path and thus not read as a URL. Looks like I need to use the original extractor in combination with this edited version. – AAsomb113 Jul 24 '19 at 22:02
  • If you can, can you make it so it also corrects the URLs by prepending them with the domain? Like this: [href="/chara/show/183475"] would spit out [https://www.uchinokomato.me/chara/show/183475] by adding [https://www.uchinokomato.me] at the beginning. – AAsomb113 Jul 24 '19 at 22:14
  • I updated my regex some. It might match too much but I'm not 100% sure. – maninthecomputer Jul 24 '19 at 22:26
  • I found a very dangerous glitch, if the URL contains an exclamation mark, it would output a version of that with the exclamation mark and all characters after it removed, for example: https://s3-ap-northeast-1.amazonaws.com/uchinoko/charas/avatars/000/035/413/original/20150807_!_ai.png?1441471287 Becomes: https://s3-ap-northeast-1.amazonaws.com/uchinoko/charas/avatars/000/035/413/original/20150807_ Code: https://pastebin.com/J3974iWc – AAsomb113 Dec 07 '20 at 02:07