1

i have problems with regards to indexing item names with numbers and symbols. a sample of my data is shown below:

ANGLE BARS   ORANGE - 4.0MM 2 - 1/2"
B.I SQUARE TUBING     2" X 3"
B.I. PIPE S-40   10MM 3/8"
B.I SQUARE TUBING     1" X 2"
PLYWOOD   MARINE 3/4X4X8
PLYWOOD   STA. CLARA 1/8X4X8
PLYWOOD   STA. CLARA 3/16X4X8

i want to tokenize my data in white or trailing spaces without dropping the symbols because these symbols are very essential. so that whenever i search for "plywood sta. clara", "b.i square 2" X 3"", or "angle orange 2 - 1/2" will give me a result. i tried to used whitespace analyzer but the symbols are dropped. i also tried standardanalyzer but stop words and symbols are also dropped. what is the best analyzer to use instead?

maccramers
  • 125
  • 2
  • 6

2 Answers2

3

You can use PatternAnalyzer by writing regular expression or create Custom Analyzer.

Parvin Gasimzade
  • 25,180
  • 8
  • 56
  • 83
0

Try using a org.apache.lucene.analysis.miscellaneous.PatternAnalyzer. You can supply a regular expression to define token delimiters.

z12345
  • 2,186
  • 4
  • 20
  • 28