1

In Solr, when merging tokens using solr.ShingleFilterFactory, it may generate multiple Shingles depending on the min/maxShingleSize and tokens to merged. Due to this, search fails. How can I merge multiple tokens into one so that my search works. Here are my settings:

<fieldType name="text_ngram" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b \b" replacement=""/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
   <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
       <filter class="solr.ShingleFilterFactory" tokenSeparator="" minShingleSize="2" maxShingleSize="7" outputUnigrams="false"/>
       <filter class="solr.LengthFilterFactory" min="6" max="7"/>
   </analyzer>
</fieldType>

Here is the debug output for query name_ngram:"our G20 9NS"

"debug": {
    "rawquerystring": "name_ngram:\"our G20 9NS\"",
    "querystring": "name_ngram:\"our G20 9NS\"",
    "parsedquery": "PhraseQuery(name_ngram:\"rg209ns g209ns\")",
    "parsedquery_toString": "name_ngram:\"rg209ns g209ns\"",
    "explain": {},

Thanx in advance,

2 Answers2

0

I was able to resolve this problem by moving synonym mapping to outside of solr config. I wrote some custom code that takes care of it. Here is the final schema:

<!-- Added for NGram field-->
<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b \b" replacement=""/>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="\b \b" replacement=""/>
  </analyzer>
</fieldType>
0

I faced the same challenge and solved it like this without any custom code:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.FingerprintFilterFactory" separator="_" />
<filter class="solr.PatternReplaceFilterFactory" pattern="(_)" replacement="" replace="all"/>

The key point being to finger-print with _ and then replacing _ with empty

Hope it helps

ericminio
  • 116
  • 5