Mapping plain text back into HTML document

Question

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.

I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:

Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.

Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.

(I was a bit iffy on how to tag this question, so feel free to re-tag it.)

score 2 · Accepted Answer · edited May 23 '17 at 11:49

2

Use a range utility method plus an annotation library such as one of the following:

edited May 23 '17 at 11:49

Community

1
1

answered Aug 30 '12 at 18:22

Paul Sweatte

24,148
7
127
265

Thanks for the links! I'll check them out in the next couple days. – Peter Rowell Aug 30 '12 at 18:34
Sorry for the late acceptance. Your answer pointed me in some interesting directions -- not exactly where I thought they would, but interesting nevertheless. – Peter Rowell Jan 07 '13 at 17:33

score 1 · Answer 2 · answered Jan 25 '15 at 00:17

The free software Rangy JavaScript library is your friend. Regarding your two tasks:

Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.

Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by @Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

This sounds like an excellent possibility. Once I recover my health and get going on my project again it will be near the top of my Check This Out list. Thanks! — Peter Rowell, Jan 26 '15 at 20:42

Mapping plain text back into HTML document

2 Answers2