Can we use WhitespaceTokenizerFactory & StandardToken together to accept only few specific symbols?

Question

In my scenario i need to use WhitespaceTokenizerFactory & StandardTokenizerFactory together. Is there any way to use both of them together?? My scenario looks like this :
1. I used WhitespaceTokenizerFactory to search for words like C# or C++.
2. But, in this case if I search for SQL, (with the comma) then results with only SQL, pattern appear.
Expected result : search query must be considered as SQL.

My schema.xml looks as below

    <fieldType name="text_general" class="solr.TextField"  positionIncrementGap="100">

      <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" preserveOriginal="1" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" preserveOriginal="1" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

score 0 · Answer 1 · answered Dec 15 '15 at 17:42

0

If you want to use two different tokenization schemes, you should copy the content into multiple fields with the desired analysis setup. Solr makes this easy using it's copyFields.

So you could have fieldTypes defined:

<fieldType name="text_general" class="solr.TextField"  positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" preserveOriginal="1" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
</fieldType>
<fieldType name="text_standard" class="solr.TextField"  positionIncrementGap="100">
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
</fieldType>

And then define a copyField such as:

<copyField source="myTextField_whitespace" dest="myTextField_standard" />

answered Dec 15 '15 at 17:42

femtoRgon

32,893
7
60
87

I have created the copy field and re-indexed, but there is no use I am not able to search for special symbols from the copy field.
query params:
"params": { "indent": "true", "q": "*:*", "_": "1450672539308", "wt": "json", "fq": "FEEDBACK2:\"C++\"" } – Sravanthi Dec 21 '15 at 04:32
Looks like feedback2 is using standard analysis, so of course it won't match. You need to use the field appropriate to the use case, or you could always query both fields. – femtoRgon Dec 21 '15 at 05:02
I need a field which works for both **C++/C#** and **SQL,** (observe the comma) ........If I use just WhitespaceTokenizerFactory, then indexing is done on spaces and search for **SQL,** results only **SQL,** instead of **SQL**. – Sravanthi Dec 21 '15 at 06:26
That's why I recommended querying both fields. – femtoRgon Dec 21 '15 at 06:28

Can we use WhitespaceTokenizerFactory & StandardToken together to accept only few specific symbols?

1 Answers1