Extract all urls in a string with python3

Question

I am trying to find a clean way to extract all urls in a text string.

After an extensive search, i have found many posts suggesting using regular expressions to do the task and they give the regular expressions that suppose to do that. Each of the RegExs have some advantages and some short comings. Also, editing them to change their behaviour is not straight forward. Anyway at this point i am happy with any RegEx that could detect the urls in this text correctly:

Input:

Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org. Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore qualisque.

Output:

['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']

But if there is a python3 class/function/library, that finds all urls in a given text and takes parameters to:

select which protocols to detect
select which TLDs are allowed
select which domains are allowed

I would be very happy to know about it.

I think you fell asleep while writing your question title.. – Jun 20 '17 at 06:00 — , Jun 20 '17 at 06:00
Maybe. So, I've edited the question title... – Ouss Jun 20 '17 at 07:32 — Ouss, Jun 20 '17 at 07:32

score 6 · Accepted Answer · 2017-06-20T06:38:02.997

Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.

Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).

You have a couple of examples here.

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']

It seems that this module also has an update() method which lets you update the TLD list cache file

However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:

result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']

You can then build another lists which hold the excluded domains / TLDs / etc:

allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']

for each_url in results:
    # here, check each url against your rules

Gahan · Answer 2 · 2017-06-20T06:44:58.393

import re
import string
text = """
Lorem ipsum dolor sit amet https://www.lore-m.com/ipsum.php?q=suas, 
nusquam tincidunt ex per, ftp://link.com ius modus integre no, quando utroque placerat qui no. 
Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. 
Elit ftp://link.work.in pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org. 
Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore 
qualisque.
"""

URL_REGEX = r"""((?:(?:https|ftp|http)?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|org|uk)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|uk|ac)\b/?(?!@)))"""

urls = re.findall(URL_REGEX, text)
print([''.join(x for x in url if x in string.printable) for url in urls])

Now if you want to keep only urls with valid domains you can write it as follow:

VALID_DOMAINS = ['lorem.org', 'bbc.co.uk', 'sample.com', 'link.net']
valid_urls = []
for url in result_url:
    for val_domain in VALID_DOMAINS:
        if val_domain in url:
            valid_urls.append(url)
print(valid_urls)

Piotr Wasilewicz · Answer 3 · 2017-06-20T05:52:12.473

1

output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
print(output)

your example: http://ideone.com/wys57x

After all you can also cut last character in elements of list if it is not a letter.

EDIT:

output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
newOutput = []
for link in output:
    copy = link
    while not copy[-1].isalpha():
        copy = copy[:-1]
    newOutput.append(copy)
print(newOutput)

Your example: http://ideone.com/gHRQ8w

edited Jun 20 '17 at 05:52

answered Jun 20 '17 at 05:33

Piotr Wasilewicz

1,751
2
15
26

Thanks for the reply. your method fails to exclude the smily in "https://www.lorem.org" – Ouss Jun 20 '17 at 05:40
Nope. Just "cut last character in elements of list if it is not a letter." – Piotr Wasilewicz Jun 20 '17 at 05:47
@Ouss check my answer now. – Piotr Wasilewicz Jun 20 '17 at 05:53
Can try this too: content = input.split(' ') newOutput = [] for val in content: if val.startswith('http://') or val.startswith('https://'): newOutput.append(val) – Anubhav Singh Jun 20 '17 at 05:59
@AnubhavSingh I think comprehensions list is better here. And `content = input().split(' ')` :) – Piotr Wasilewicz Jun 20 '17 at 06:09

Taku · Answer 4 · 2017-06-20T06:26:32.107

1

If you want a regex, you can use this:

import re


string = "Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org. Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore qualisque."

result = re.findall(r"\w+://\w+\.\w+\.\w+/?[\w\.\?=#]*", string)
print(result)

Output:

['https://www.lorem.com/ipsum.php?q=suas', 
 'https://www.lorem.org', 
 'http://news.bbc.co.uk']

edited Jun 20 '17 at 06:26

answered Jun 20 '17 at 05:57

Taku

31,927
11
74
85

2

check again your result is not what you have written it is `['https://www.lorem.com/', 'https://www.lorem.org.', 'http://news.bbc.co.']` – Gahan Jun 20 '17 at 06:10
Ohhh sorry I made a typo when copy and pasting the code, miss wrote \w to \d @Gahan thanks for catching that – Taku Jun 20 '17 at 06:15

score 0 · Answer 5 · answered Mar 09 '18 at 09:35

Using an existing library is probably the best solution.

But it was too much for my tiny script, and -- inspired by @piotr-wasilewiczs answer-- I came up with:

from string import ascii_letters
links = [x for x in line.split() if x.strip(str(set(x) - set(ascii_letters))).startswith(('http', 'https', 'www'))]

for each word in the line,
strip (from the beginning and the end) the non ASCII letters found in the word itself)
and filter by the words starting with one of https, http, www.

A bit too dense for my taste and I have no clue how fast it is, but it should detect most "sane" urls in a string.

Extract all urls in a string with python3

5 Answers5

Linked

Related