We have a glossary with up to 2000 terms (where each glossary term may consist of one, two or three words (either separated with whitespaces or a dash).
Now we are looking for a solution for highlighting all terms inside a (longer) HTML document (up to 100 KB of HTML markup) in order to generate a static HTML page with the highlighted terms.
The constraints for a working solution are: large number of glossary terms and long HTML documents...what would be the blueprint for an efficient solution (within Python).
Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.
Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.
Any better idea?