Find all HTML and non-HTML encoded URLs in string

Question

I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.

For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.

On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.

I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:

import lxml.html

example_data = """<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/@href'):
    print "Found Link: ", link

As expected, this results in:

Found Link:  http://www.some-random-domain.com/abc123/def.html

I also tried the twitter-text-python library that @Yannisp mentioned, but it doesn't seem to extract both URLS:

>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']

What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?

score 1 · Accepted Answer · answered May 19 '15 at 16:25

1

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

answered May 19 '15 at 16:25

Yannis P.

2,745
1
24
39

That looks like a cool library. I took a peek at the source and it appears the author manually extracts the URL. I updated my question with the results of my testing this library. – Dirty Penguin May 19 '15 at 18:17
Accepting your answer, as the `twitter-text-python` helped me fill the gap that `lxml` left. I ended up using both. Thanks. – Dirty Penguin May 19 '15 at 18:58

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

You could use RE to find all URLs:

import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)

It's including alphanumerics, '/' and "Characters allowed in a URL"

edited May 23 '17 at 12:05

Community

1
1

answered May 19 '15 at 18:34

chishui

176
1
6

score 0 · Answer 3 · answered May 19 '15 at 18:57

Based on the answer by @YannisP, I was able to come up with this solution:

import lxml.html  
from ttp.ttp import Parser

def extract_urls(data):
    urls = set()
    # First extract HTML-encoded URLs
    dom = lxml.html.fromstring(data)
    for link in dom.xpath('//a/@href'):
        urls.add(link)
    # Next, extract URLs from plain text
    parser = Parser()
    results = parser.parse(data)
    for url in results.urls:
        urls.add(url)
    return list(urls)

This results in:

>>> example_data
'<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']

I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do.

Find all HTML and non-HTML encoded URLs in string

3 Answers3