Using Regex to Search for HTML links near keywords

Question

If I'm looking for the keyword "sales" and I want to get the nearest "http://www.somewebsite.com" even if there is multiple links in the file. I want the nearest link not the first link. This means I need to search for the link that comes just before the keyword match.

This doesn't work...

regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales sales

Whats the best way to find a link that is closest to a keyword?

Please test your code with ... if it works let me know... I'm kinda curious... somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword" keyword = "mykeyword" — Asher, Nov 30 '12 at 03:40
Its better to search for relationships within each page. If others have ideas on this please let me know! :) — Asher, May 06 '13 at 03:02

unutbu · Answer 1 · 2012-01-25T19:13:15.693

It is generally much easier and more robust to use an HTML parser rather than regex.

Using the third-party module lxml:

import lxml.html as LH

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

doc = LH.fromstring(content)    
for url in doc.xpath('''
    //*[contains(text(),"sales")]
    /preceding::*[starts-with(@href,"http")][1]/@href'''):
    print(url)

yields

http://www.somewebsite.com

I find lxml (and XPath) a convenient way to express what elements I'm looking for. However, if installing a third-party module is not an option, you could also accomplish this particular job with HTMLParser from the standard library:

import HTMLParser
import contextlib

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.last_link = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if 'href' in attrs:
            self.last_link = attrs['href']

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

idx = content.find('sales')

with contextlib.closing(MyParser()) as parser:
    parser.feed(content[:idx])
    print(parser.last_link)

Regarding the XPath used in the lxml solution: The XPath has the following meaning:

 //*                              # Find all elements
   [contains(text(),"sales")]     # whose text content contains "sales"
   /preceding::*                  # search the preceding elements 
     [starts-with(@href,"http")]  # such that it has an href attribute that starts with "http"
       [1]                        # select the first such <a> tag only
         /@href                   # return the value of the href attribute

Ok, let me test this out and see if it works... I didn't realize python had such a library. — Asher, Jan 24 '12 at 11:46
I'm looking for something that doesn't involve the use of some non-python standard library. I'll look into it but this isn't really what I need is it? — Asher, Jan 25 '12 at 01:46
An HTML/XML parser and XPATH are definitely the way to go if you're going to keep working with (extracting data from) HTML/XML. — MattH, Jan 25 '12 at 20:56
@VSH please PLEASE don't do this with regex PLEASE. It's awful, and you're going to make a mistake. See the earlier comment for a good example of why not. There are plenty of other libraries if you have something against lxml. — Matt Luongo, Jan 30 '12 at 02:36

score 0 · Answer 2 · answered Jan 23 '12 at 02:21

I don't think you can do this one with regex alone (especially looking before the keyword match) as it has no sense of comparing distances.

I think you're best off doing something like this:

find all occurences of sales & get substring index, called salesIndex
find all occurences of https?://[-A-Za-z0-9./]+ and get the substring index, called urlIndex
loop through salesIndex. For each location i in salesIndex, find the urlIndex closest.

Depending on how you want to judge "closest" you may need to get the start and end indices of the sales and http... occurences to compare. i.e., find the end index of a URL that is closest to the start index of the current occurence of sales, and find the start index of a URL that is closest to the end index of the current occurence of sales, and pick the one that is closer.

You can use matches = re.finditer(pattern,string,re.IGNORECASE) to get a list of matches, and then match.span() to get the start/end substring indices for each match in matches.

This is what I may do. I prefer something where I can use regex or do a regex look behind. I may have to store the integer position of all the links and compare those to the indexes of the keyword matches. Yuck. — Asher, Jan 27 '12 at 05:12

score 0 · Answer 3 · answered Jan 25 '12 at 20:42

Building on what mathematical.coffee suggested, you could try something along these lines:

import re
myString = "" ## the string you want to search

link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)

link_locations = []

for match in link_matches:
    link_locations.append([match.span(),match.group()])

for match in sales_matches:
    match_loc = match.span()
    distances = []
    for link_loc in link_locations:
        if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
            ## append the distance between the END of the keyword and the START of the link
            distances.append(match_loc[0] - link_loc[0][1])
        else:
            ## append the distance between the END of the link and the START of the keyword
            distances.append(link_loc[0][0] - match_loc[1])

    for d in range(0,len(distances)-1):
        if distances[d] == min(distances):
            print ("Closest Link: " + link_locations[d][1] + "\n")
            break

let me look over this code carefully. It looks like this maybe correct. I'm just worried about doing a lot of indexing of links that maybe are not needed. However, the truth is I never really know how far a link is from a keyword match pair. — Asher, Jan 27 '12 at 05:14

score -1 · Accepted Answer · edited Jan 30 '12 at 01:29

I tested out this code and it seems to be working...

def closesturl(keyword, website):
    keylist = []
    urllist = []
    closest = []
    urls = []
    urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
    urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
    keymatches = re.finditer(keyword, website, re.IGNORECASE)
    for n in keymatches:
        keylist.append([n.start(), n.end()])
    if(len(keylist) > 0):
        for m in urlmatches:
            urllist.append([m.start(), m.end()])
    if((len(keylist) > 0) and (len(urllist) > 0)):
        for i in range (0, len(keylist)):
            closest.append([abs(urllist[0][0]-keylist[i][0])])
            urls.append(website[urllist[0][0]:urllist[0][1]])
            if(len(urllist) >= 1):
                for j in range (1, len(urllist)):
                    if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
                        closest[i] = abs(keylist[i][0]-urllist[j][0])
                        urls[i] = website[urllist[j][0]:urllist[j][1]]
                        if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
                            break # local minimum / inflection point break from url list                                                      
    if((len(keylist) > 0) and (len(urllist) > 0)):
        return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]                                                                
    else:
        return ""

    somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
    keyword = "mykeyword"
    print closesturl(keyword, somestring)

The above when run shows... http://www.secondlink.com.

If someones got ideas on how to speed up this code that would be awesome!

Thanks V$H.

In general it's not a great idea to use regular expressions to parse HTML, although they may be reliable for very simple tasks. See this (famous!) answer for more details: http://stackoverflow.com/a/1732454/342327 — snim2, Jan 30 '12 at 01:04
-1 (after reading the "famous answer"... who can be left unchanged??) — lajarre, Mar 22 '13 at 19:18

Using Regex to Search for HTML links near keywords

4 Answers4