SOLR - how to highlight exact phrases for wildcard searching results

Question

This is my field type declared in schema:

<fieldType name="c_string" class="solr.TextField">
 <analyzer type="index">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ReversedWildcardFilterFactory" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>

I can search using wildcards without any problems. But I have some problems with highlight feature. Solr highlights entire and not only matched phrase. For example my search query is title:Keyword*. So solr will only display documents matching wilcard. But highlight is:

"title": [
        "<em>Keyword and the rest of title</em>"

but I want:

"title": [
        "<em>Keyword</em> and the rest of title"

This works as I want if I use solr.EdgeNGramFilterFactory like this:

<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
   </analyzer>
</fieldType>

If I use it, highlight is ok, but wildcards are ignored. Solr always searches like with wildcards, title:Keyword title:Keyword* works the same - obviously title:Keyword should not match anything.

Do you have any tips?

[added] Example query:

select?q=text_dsc%3A*dobry*&rows=200&wt=json&indent=true&hl=true&hl.fl=text_dsc&hl.simple.pre=<em>&hl.simple.post=<%2Fem>

Example highlight result:

  "highlighting":{
    "25352":{
      "text_dsc":["<em>14276|\nDzień dobry -  dokument testowy. \n\n \n\nTEST. \n\n\n</em>"]},
    "25353":{
      "text_dsc":["<em>14276|\nDzień dobry -  dokument testowy. \n\n \n\nTEST. \n\n\n</em>"]},
    "26693":{
      "text_dsc":["<em>14276|\nDzień dobry -  dokument testowy. \n\n \n\nTEST. \n\n\n</em>"]}}}

As you can see, query string is dobry, but entire field is highlighted. Why? If I use solr.EdgeNGramFilterFactory as mentioned above, with the same query highlight is correct but searching is incorrect (always wildcard)

Can you please post an example query, especially the highlighting parameters? — lxg, Sep 15 '14 at 13:07
Question updated. Query is generated by solr webadmin interface. — user1209216, Sep 15 '14 at 13:24

score 3 · Answer 1 · edited May 23 '17 at 12:07

Use StandardTokenizerFactory and you will get the desired output:

<fieldType name="c_string" class="solr.TextField">
 <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ReversedWildcardFilterFactory" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>

The difference between the StandardTokenizerFactory and KeywordTokenizerFactory factory is very well explained in this question: difference between StandardTokenizerFactory and KeywordTokenizerFactory in SoLR

UPDATE

Index text_dsc in two different fields like

   <fieldType name="text_dsc" class="solr.TextField">
 <analyzer type="index">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ReversedWildcardFilterFactory" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>



<fieldType name="text_dsc_standard" class="solr.TextField">
 <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ReversedWildcardFilterFactory" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>

And in your search query set hl.fl=text_dsc_standard.

Sorry it doesn't work as should. It always rerurns result as sub-word. For example: title:Keyword* and title:Keyword returns the same results. It's unacceptable. title:Keyword should not return anything, because there is no wildcard — user1209216, Sep 16 '14 at 10:50
see keyword* means search for keyword in title and rest can be anything, and on the other side if you search keyword then if title contains keyword then it is the part of search result.Logically both will return the same result. Forex:If you search keywd then it will not show any result and if you search keyw**d then it will match the results containing anything in between.Correct me if i am wrong. — Alok Chaudhary, Sep 16 '14 at 11:28
It's not purpose of wildcards for me. Example field content "One, Two". Query "One" shoud not return anything baceuse it doesn match phrase "One, Two". But query "One*" should return result, because it means "One" and any text after it. With your solution, both returns the same and it's wrong. — user1209216, Sep 16 '14 at 11:38
To be clear - I need entire equivalent for SQL LIKE syntax with wildcards. Substrig, without any care about string content. — user1209216, Sep 16 '14 at 11:53

SOLR - how to highlight exact phrases for wildcard searching results

1 Answers1