Turn text URLs inside DOM into links with Python, possibly with help of lxml

Question

I have an HTML document in which there are both link tags and text URLs. I'd like to wrap the text URLs into anchor tags, while leaving the existing links tags unchanged. This snippet turns all URLs into anchors, but it double wraps existing tags into anchors, too:

def replace_url_to_link(value):
    # Replace url to link
    urls = re.compile(r"((https?):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", re.MULTILINE|re.UNICODE)
    value = urls.sub(r'<a href="\1" target="_blank">\1</a>', value)
    # Replace email to mailto
    urls = re.compile(r"([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)", re.MULTILINE|re.UNICODE)
    value = urls.sub(r'<a href="mailto:\1">\1</a>', value)
    return value

Here's a PHP/Regex solution: regex to turn URLs into links without messing with existing links in the text. However, I couldn't find a Python answer to this question. Also, since I'm walking over the DOM tree with lxml anyways, I'd prefer an lxml solution over regex - also performance-wise.

Yet, it's a cleaned HTML document, meaning all existing anchors are identically formed and regex is at least a possible choice.

http://stackoverflow.com/questions/2073541/search-and-replace-in-html-with-beautifulsoup or http://stackoverflow.com/questions/459981/beautifulsoup-modifying-all-links-in-a-piece-of-html beautiful soup is pretty easy to work with. — user1269942, Jun 22 '15 at 17:12
Both SO answers don't really tackle this problem. Also, concerning performance it doesn't make sense to parse the DOM doc with BS in addition to lxml. As AFAIK, lxml is magnitudes faster than BS. So, that doesn't help. — Simon Steinberger, Jun 22 '15 at 17:16
you're right! sorry about that...and thanks for pointing it out. I should wait until my morning coffee is done before coming on SO! You've got the appropriate tags...hopefully a regex guru will come help. 'luck. — user1269942, Jun 22 '15 at 17:21

Turn text URLs inside DOM into links with Python, possibly with help of lxml

0 Answers0