I have an HTML document in which there are both link tags and text URLs. I'd like to wrap the text URLs into anchor tags, while leaving the existing links tags unchanged. This snippet turns all URLs into anchors, but it double wraps existing tags into anchors, too:
def replace_url_to_link(value):
# Replace url to link
urls = re.compile(r"((https?):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", re.MULTILINE|re.UNICODE)
value = urls.sub(r'<a href="\1" target="_blank">\1</a>', value)
# Replace email to mailto
urls = re.compile(r"([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)", re.MULTILINE|re.UNICODE)
value = urls.sub(r'<a href="mailto:\1">\1</a>', value)
return value
Here's a PHP/Regex solution: regex to turn URLs into links without messing with existing links in the text. However, I couldn't find a Python answer to this question. Also, since I'm walking over the DOM tree with lxml anyways, I'd prefer an lxml solution over regex - also performance-wise.
Yet, it's a cleaned HTML document, meaning all existing anchors are identically formed and regex is at least a possible choice.