2

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.

I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:

  1. Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.

  2. Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.

Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.

(I was a bit iffy on how to tag this question, so feel free to re-tag it.)

Peter Rowell
  • 17,605
  • 2
  • 49
  • 65

2 Answers2

2

Use a range utility method plus an annotation library such as one of the following:

Community
  • 1
  • 1
Paul Sweatte
  • 24,148
  • 7
  • 127
  • 265
1

The free software Rangy JavaScript library is your friend. Regarding your two tasks:

  1. Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].

  2. Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.

Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by @Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

tanius
  • 14,003
  • 3
  • 51
  • 63
  • This sounds like an excellent possibility. Once I recover my health and get going on my project again it will be near the top of my Check This Out list. Thanks! – Peter Rowell Jan 26 '15 at 20:42