3

I am trying to use Solr to find exact matches on categories in a user search (e.g. "skinny jeans" in "blue skinny jeans"). I am using the following type definition:

<fieldType name="subphrase" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" 
                pattern="\ " 
                replacement="_"/>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
  <analyzer type="query">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ShingleFilterFactory" 
            outputUnigrams="true"
            outputUnigramsIfNoShingles="true"
            tokenSeparator="_"
            minShingleSize="2"
            maxShingleSize="99"/>
  </analyzer>
</fieldType>

The type will index categories without tokenizing, only replacing whitespace with underscores. But it will tokenize queries and shingle them (with underscores).

What I am trying to do is match the query shingles against the indexed categories. In the Solr Analysis page I can see that the whitespace/underscore replacement works on both index and query, and I can see that the query is being shingled correctly (screenshot below):

Successful whitespace modification on index, and shingle generation on query

My problem is that in the Solr Query page, I cannot see shingles being generated, and I presume that as a result the category "skinny jeans" is not matched, but the category "jeans" is matched :(

This is the debug output:

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "q": "name:(skinny jeans)",
      "indent": "true",
      "wt": "json",
      "debugQuery": "true",
      "_": "1464170217438"
    }
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": 33,
        "name": "jeans",
      }
    ]
  },
  "debug": {
    "rawquerystring": "name:(skinny jeans)",
    "querystring": "name:(skinny jeans)",
    "parsedquery": "name:skinny name:jeans",
    "parsedquery_toString": "name:skinny name:jeans",
    "explain": {
      "33": "\n2.2143755 = product of:\n  4.428751 = sum of:\n    4.428751 = weight(name:jeans in 54) [DefaultSimilarity], result of:\n      4.428751 = score(doc=54,freq=1.0), product of:\n        0.6709952 = queryWeight, product of:\n          6.600272 = idf(docFreq=1, maxDocs=541)\n          0.10166174 = queryNorm\n        6.600272 = fieldWeight in 54, product of:\n          1.0 = tf(freq=1.0), with freq of:\n            1.0 = termFreq=1.0\n          6.600272 = idf(docFreq=1, maxDocs=541)\n          1.0 = fieldNorm(doc=54)\n  0.5 = coord(1/2)\n"
    },
    "QParser": "LuceneQParser"
  }
}

It's clear that the parsedquery parameter does not display the shingled query. What do I need to do to complete the process of matching query shingles against indexed values? I feel like I am very close to cracking this problem. Any advice is appreciated!

mils
  • 1,878
  • 2
  • 21
  • 42
  • Have you tried name:"skinny jeans"? – MatsLindh May 25 '16 at 12:08
  • Yes, nothing is returned, not even "jeans". This may be related to another question I raised @ [link](https://stackoverflow.com/questions/37425263/solr-keywordtokenizerfactory-exact-match-for-multiple-words-not-working) As @Abhijit Bashetti mentioned, tokens do not work that way, they are unsequenced. In addition, I actually don't want it to work that way, I don't want to use quotes as I'm looking for a substring, and this would defeat the purpose. – mils May 25 '16 at 23:51

1 Answers1

2

This is an incomplete answer, but it might be enough to get you moving.

1: You probably want outputUnigrams="false", so you don't match category "jeans" on query "skinny jeans"

2: You actually do want to search with quotes, (a phrase) or the field won't ever see more than one token to shingle.

3: It seems like you're trying to do the same thing as this person was: http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746

That thread looks like it lead to the inclusion of the PositionFilterFactory https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

If you're using Solr < 5.0, try putting that at the end of your query time analysis and see if it works.

Unfortunately, that filter factory was removed in 5.0. This is the only comment I've found about what to do instead: http://lucene.apache.org/core/4_10_0/analyzers-common/org/apache/lucene/analysis/position/PositionFilter.html

I played with autoGeneratePhraseQueries a little, but I have yet to find another way to prevent Solr from generating a MultiPhraseQuery.

randomstatistic
  • 800
  • 5
  • 11