Highlighting glossary terms inside a HTML document

Question

We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash).

Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms.

The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python).

Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.

Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.

Any better idea?

The fact that I commented and not answered should point out that I might have had enough time on my hands to actually write a proper answer. See below. — Thomas Orozco, Dec 03 '11 at 11:34
You don't have to client-side highlite in one loop. Use setTimeout to simulate co-routines. — Dykam, Dec 03 '11 at 13:53

Thomas Orozco · Answer 1 · 2011-12-03T12:06:56.427

You could use a parser to navigate your tree in a recursive manner and replace only tags that are made of text.
In doing so, there are still several things you will need to account for:
- Not all text needs to be replaced (ex. Inline javascript)
- Some elements of the document might not need parsing (ex. Headings, etc.)

Here's a quick and non-production ready example of how you could achieve this :

html = """The HTML you need to parse"""
import BeautifulSoup

IGNORE_TAGS = ['script', 'style']

def parse_content(item, replace_what, replace_with, ignore_tags = IGNORE_TAGS):
    for content in item.contents:
        if isinstance(content, BeautifulSoup.NavigableString):
            content.replaceWith(content.replace(replace_what, replace_with, ignore_tags))
        else:
            if content.name not in ignore_tags:
                parse_content(content, replace_what, replace_with, ignore_tags)
    return item

soup = BeautifulSoup.BeautifulSoup(html)
body = soup.html.body
replaced_content = parse_content(body, 'a', 'b')

This should replace any occurence of an "a" with a "b", however leaving content that is:
- Inside inline javascript or css (Although inline JS or CSS should not appear in a document's body).
- A reference in a tag such as img, a...
- A tag itself

Of course, you will then need, depending on your glossary, to make sure that you don't replace only part of a word with something else ; to do this it makes sense to use regex insted of content.replace.

score 0 · Answer 2 · answered Dec 03 '11 at 12:39

I think highlighting with client-side javascript is the best option. It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats.

To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. Here's an example of this approach

function hilite(terms, chunkSize) {

    // prepare stuff

    var terms = new RegExp("\\b(" + terms.join("|") + ")\\b", "gi");

    // collect all text nodes in the document

    var textNodes = [];
    $("body").find("*").contents().each(function() {
        if (this.nodeType == 3)
            textNodes.push(this)
    });

    // process N text nodes at a time, surround terms with text "markers"

    function step() {
        for (var i = 0; i < chunkSize; i++) {
            if (!textNodes.length)
                return done();
            var node = textNodes.shift();
            node.nodeValue = node.nodeValue.replace(terms, "\x1e$&\x1f");
        }
        setTimeout(step, 100);
    }

    // when done, replace "markers" with html

    function done() {
        $("body").html($("body").html().
            replace(/\x1e/g, "<b>").
            replace(/\x1f/g, "</b>")
        );
    }

    // let's go

    step()
}

Use it like this:

$(function() {
    hilite(["highlight", "these", "words"], 100)
})

Let me know if you have questions.

score -1 · Answer 3 · answered Dec 03 '11 at 10:18

-1

How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color.

answered Dec 03 '11 at 10:18

U-DON

2,102
14
14

Then what if the glossary term is in the page's title, or meta (which is expectable), or worse, if the document is about html itself and say, "form" is a glossary term? – Thomas Orozco Dec 03 '11 at 10:21
Good point. The regex could account for elements by not considering anything enclosed in "<>". – U-DON Dec 03 '11 at 10:24
1

Let me apologize for insisting, but using regex to parse HTML is [a bad idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). What's more, unless your problem is very simple (Which may or may not be the case here), you're probably not going to get it right unless you're a regex maestro. Parsing HTML **is** a solved issue, so you should use the right tools to do so. – Thomas Orozco Dec 03 '11 at 10:32
Alright. I was just throwing a suggestion out there, not particularly championing it as the best solution or anything. I'm no regex maestro myself, so I'm not even sure if I would get it right with this method. – U-DON Dec 03 '11 at 17:18

Highlighting glossary terms inside a HTML document

3 Answers3