I am trying to use Solr to find exact matches on categories in a user search (e.g. "skinny jeans" in "blue skinny jeans")
. I am using the following type definition:
<fieldType name="subphrase" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\ "
replacement="_"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"
tokenSeparator="_"
minShingleSize="2"
maxShingleSize="99"/>
</analyzer>
</fieldType>
The type will index categories without tokenizing, only replacing whitespace with underscores. But it will tokenize queries and shingle them (with underscores).
What I am trying to do is match the query shingles against the indexed categories. In the Solr Analysis page I can see that the whitespace/underscore replacement works on both index and query, and I can see that the query is being shingled correctly (screenshot below):
My problem is that in the Solr Query page, I cannot see shingles being generated, and I presume that as a result the category "skinny jeans" is not matched, but the category "jeans" is matched :(
This is the debug output:
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "name:(skinny jeans)",
"indent": "true",
"wt": "json",
"debugQuery": "true",
"_": "1464170217438"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": 33,
"name": "jeans",
}
]
},
"debug": {
"rawquerystring": "name:(skinny jeans)",
"querystring": "name:(skinny jeans)",
"parsedquery": "name:skinny name:jeans",
"parsedquery_toString": "name:skinny name:jeans",
"explain": {
"33": "\n2.2143755 = product of:\n 4.428751 = sum of:\n 4.428751 = weight(name:jeans in 54) [DefaultSimilarity], result of:\n 4.428751 = score(doc=54,freq=1.0), product of:\n 0.6709952 = queryWeight, product of:\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 0.10166174 = queryNorm\n 6.600272 = fieldWeight in 54, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 1.0 = fieldNorm(doc=54)\n 0.5 = coord(1/2)\n"
},
"QParser": "LuceneQParser"
}
}
It's clear that the parsedquery parameter does not display the shingled query. What do I need to do to complete the process of matching query shingles against indexed values? I feel like I am very close to cracking this problem. Any advice is appreciated!