I am trying to find a clean way to extract all urls in a text string.
After an extensive search, i have found many posts suggesting using regular expressions to do the task and they give the regular expressions that suppose to do that. Each of the RegExs have some advantages and some short comings. Also, editing them to change their behaviour is not straight forward. Anyway at this point i am happy with any RegEx that could detect the urls in this text correctly:
Input:
Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org. Pri posse constituam in, sit http://news.bbc.co.uk omnium assentior definitionem ei. Cu duo equidem meliore qualisque.
Output:
['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk']
But if there is a python3 class/function/library, that finds all urls in a given text and takes parameters to:
- select which protocols to detect
- select which TLDs are allowed
- select which domains are allowed
I would be very happy to know about it.