Python String Replace: Keywords into URLs

Question

I am going to replace some keywords with urls in a string, for example,

content.replace("Google","<a href="http://www.google.com">Google</a>")

However, I only want to replace keywords with urls ONLY if not already wrapped in a url.

The content is simple HTML:

<p><b>This is an example!</b></p><p>I love <a href="http://www.google.com">Google</a></p><p><a href="http://www.google.com"><img src="/google.jpg" /></a></p>

Mainly <a> and <img> tags.

The main question: How to determine if a keyword is already wrapped in a <a> or <img> tag?

Here is a similar question in PHP find and replace keywords with urls ONLY if not already wrapped in a url, but the answer is not an efficient one.

Is there some better solutions in Python? Better with code examples. Thanks!

Could you give an example of the kind of text you want want to run this function on? — Acorn, Jun 09 '12 at 11:30
@Acorn HTML web page. Example: `
This is an example!
I love Google
` — Susan Mayer, Jun 09 '12 at 11:32
You can use the example I have shown below to create a regular expression that matches that with or tags. — tabchas, Jun 09 '12 at 12:08

score 4 · Answer 1 · edited May 23 '17 at 12:03

4

I am using Beatiful Soup for parsing my HTML, since parsing HTML with regex can..be proven tricky. If you use beautiful soup you can toy with previous_sibling and previous_element figure out what you need.

You interact in this fashion:

soup.find_all('img')

edited May 23 '17 at 12:03

Community

1
1

answered Jun 09 '12 at 21:02

topless

8,069
11
57
86

score 3 · Accepted Answer · answered Jun 09 '12 at 22:02

As Chris-Top said, BeautifulSoup is the way to go:

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import re    

html = """
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Dog'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""
soup = BeautifulSoup(html)

#search term, url reference
keywords = [("dog","http://en.wikipedia.org/wiki/Dog"),
            ("fox","http://en.wikipedia.org/wiki/Fox")]

def insertLinks(string_value,string_href):
    for t in soup.findAll(text=re.compile(string_value, re.IGNORECASE)):
            if t.parent.name !='a':
                    a = Tag('a', name='a')
                    a['href'] = string_href
                    a.insert(0, NavigableString(string_value))
                    string_list = re.compile(string_value, re.IGNORECASE).split(t)
                    replacement_text = soup.new_string(string_list[0])
                    t.replace_with(replacement_text)
                    replacement_text.insert_after(a)
                    a.insert_after(soup.new_string(string_list[1]))


for word in keywords:
    insertLinks(word[0],word[1])

print soup

Will yield:

<div>
    <p>The quick brown <a href="http://en.wikipedia.org/wiki/Dog">fox</a> jumped over the lazy <a href="http://en.wikipedia.org/wiki/Dog">dog</a></p>
    <p>The <a href="http://en.wikipedia.org/wiki/Dog">dog</a>, who was, in reality, not so lazy, gave chase to the <a href="http://en.wikipedia.org/wiki/Fox">fox</a>.</p>
    <p>See image for reference:</p>
    <img src="dog_chasing_fox.jpg" title="Dog chasing fox"/>
</div>

Wow this entire time I was trying to solve the problem using the HTMLParser library... I was working on it for like 3 hours... And then there was a library already made for it :( — tabchas, Jun 09 '12 at 22:45
@Kevin P thanks for putting the time to submit some working code :) — topless, Jun 10 '12 at 11:28

score 0 · Answer 3 · answered Jun 09 '12 at 11:34

You can try adding a regular expression as mentioned in the previous post. First check your string against a regular expression to check whether it has already been wrapped in a URL. This should be pretty easy as a simple call to the re library and its search() method should do the trick.

Here is a nice tutorial if you need for regular expressions and the search method specifically: http://www.tutorialspoint.com/python/python_reg_expressions.htm

After you check the string to see whether it is already wrapped in a URL or not, you can call the replace function if it is not already wrapped in a URL.

Here is a quick example that I wrote:

    import re

    x = "<a href=""http://www.google.com"">Google</a>"
    y = 'Google'

    def checkURL(string):
        if re.search(r'<a href.+', string):
            print "URL Wrapped Already"
            print string
        else:
            string = string.replace('Google', "<a href=""http://www.google.com"">Google</a>")
            print "URL Not Wrapped:"
            print string

    checkURL(x)
    checkURL(y)

I hope this answers your question!

Huh? I dont seem to get you. I am not searching for a specific string. I want to replace keywords with urls ONLY if not already wrapped in a url. — Susan Mayer, Jun 09 '12 at 11:39

Python String Replace: Keywords into URLs

3 Answers3