Lucene 4 - How to discard numeric terms in index?

Asked Feb 10 '15 at 12:09

Active May 01 '17 at 14:59

Viewed 83 times

I'm using Apache Tika to parse xml document before indexing with Apache Lucene.

This is Tika part:

  BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
  Metadata metadata = new Metadata();
  FileInputStream inputstream = new FileInputStream(f);
  ParseContext pcontext = new ParseContext();

  //Xml parser
  XMLParser xmlparser = new XMLParser(); 
  xmlparser.parse(inputstream, handler, metadata, pcontext);

  return handler.toString();// return simple text

I use StandardAnalyzer with stop words list to Tokenize my document :

 analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);  // using stop words

Can I discard numeric terms because I dont need it?

Thanks for your help.

edited May 01 '17 at 14:59

Nicomedes E.

1,326
5
18
27

asked Feb 10 '15 at 12:09

tommy

Similar [question](http://stackoverflow.com/questions/25714455/standardanalyzer-with-stemming) answered, hopefully covers your scenario? – mindas Feb 10 '15 at 12:41
`TokenStream ts = components.getTokenStream(); Set filteredTypes = new HashSet<>(); filteredTypes.add(""); TypeTokenFilter numberFilter = newTypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);` He use TokenFilter to ignor numerical term. Thanks for your help. – tommy Feb 10 '15 at 12:57
it's not realy usefull,i dont need to use porterStem analyser but i just need to filtre a numerical term – tommy Feb 10 '15 at 13:39
Just ignore the stemmer part and only use code you pasted just above. – mindas Feb 10 '15 at 18:22
ok i will try with this , thank – tommy Feb 11 '15 at 10:34

Lucene 4 - How to discard numeric terms in index?

0 Answers0