3

I have a field that might contain HTML code as a user input. If I use simple highlighter, it does not escape the input before adding the <em> tag. E.g. if the input is

"This is a <caption>"

and I search for "caption", I get:

"This is a <<em>caption</em>>"

But I want to get:

"This is a &lt;<em>caption</em>&gt;"

Which will look the same as the input with the matched word highlighted, when rendered as HTML.

Oliv
  • 10,221
  • 3
  • 55
  • 76
  • 1
    I think the best way would be to just index those characters escaped. How can Solr know that they are tags and escape them? I mean, you're indexing `<` not `<` right? – javanna Aug 14 '12 at 15:35
  • And what if I then search for `"lt"`? This is just a workaround, not a solution. Better workaround is to let solr surround matches with something non-html (say "2*@(4)m@"), then escape and replace with . But I want a real solution, I think it is there, but is just not working for me somehow, otherwise I would consider it a bug: you should generally never add HTML tags to unescaped input... – Oliv Aug 15 '12 at 06:05
  • There's a difference between what you index and what you store. You should have a look at the [HTMLStripCharFilterFactory](http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/#solr.HTMLStripCharFilterFactory) for the indexing part. If you send to Solr escaped characters that's the way they'd be stored and then showed and highlighted. But if you use that factory the named entities should be replaced within the index with the related chars, not sure about `<` and `>`. Just give it a try and let me know how it went! – javanna Aug 15 '12 at 08:02

3 Answers3

3

One technique is to use some other sentinel string to indicate highlighting. See hl.simple.pre and hl.simple.post. That way you can perform escaping first, without losing your highlighting, and then replace the sentinels with highlighting markup as a final step.

For example, the Sunspot Solr client for Ruby uses @@@hl@@@ for the hl.simple.pre param, and @@@endhl@@@ for the hl.simple.post param. Using these values…

  • Solr returns: This is a <@@@hl@@@caption@@@endhl@@@>
  • HTML escaping: This is a &lt;@@@hl@@@caption@@@endhl@@@&gt;
  • Replace the sentinels: This is a &lt;<em>caption</em>&gt;
Nick Zadrozny
  • 7,906
  • 33
  • 38
  • That's what I did, but I thought it is just my misconfiguration. In the `searchComponent` in the example server there is ``, I thought it just has to be enabled somehow... It the user enters `@@@hl@@@` to the input, he will break xml well-formedness, as it will generate unended tag. I think it is a bug, if it cannot be done better. – Oliv Aug 20 '12 at 08:13
3

Solr 4.3.1 has an option to enable a specific encoder for the higlighting to produce XML/HTML escaped snippets. Put

<str name="hl.encoder">html</str> 

below /config/requestHandler[@name="/select"]/lst[@name="defaults"] in solrconfig.xml. The parameter can also be set in the url by &hl.encoder=html. The standard solrconfig.xml contains a definition for this encoder

<!-- Configure the standard encoder -->
<encoder name="html" class="solr.highlight.HtmlEncoder" />

Example: "X < Y < Z" will be highlighted as

X &lt; <em>Y</em> &lt; Z

when searching for "Y". The Solr XML-response contains

X &amp;lt; &lt;em&gt;Y&lt;/em&gt; &amp;lt; Z

in the str-element, of course.

user2043553
  • 161
  • 6
0

You can use String.replace to replace "<<" with "&lt;<" and ">>" with ">&gt;". If you want any more specific replacements you can specify them also

Elliott Hill
  • 941
  • 7
  • 14
  • This is working only in this specific case. What if input is `"This is a caption "`? – Oliv Aug 15 '12 at 05:56