4

I'm using Apache Tika to parse xml document before indexing with Apache Lucene.

This is Tika part:

  BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
  Metadata metadata = new Metadata();
  FileInputStream inputstream = new FileInputStream(f);
  ParseContext pcontext = new ParseContext();

  //Xml parser
  XMLParser xmlparser = new XMLParser(); 
  xmlparser.parse(inputstream, handler, metadata, pcontext);

  return handler.toString();// return simple text

I use StandardAnalyzer with stop words list to Tokenize my document :

 analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);  // using stop words

Can I discard numeric terms because I dont need it?

Thanks for your help.

Nicomedes E.
  • 1,326
  • 5
  • 18
  • 27
tommy
  • 139
  • 9
  • Similar [question](http://stackoverflow.com/questions/25714455/standardanalyzer-with-stemming) answered, hopefully covers your scenario? – mindas Feb 10 '15 at 12:41
  • `TokenStream ts = components.getTokenStream(); Set filteredTypes = new HashSet<>(); filteredTypes.add(""); TypeTokenFilter numberFilter = newTypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);` He use TokenFilter to ignor numerical term. Thanks for your help. – tommy Feb 10 '15 at 12:57
  • it's not realy usefull,i dont need to use porterStem analyser but i just need to filtre a numerical term – tommy Feb 10 '15 at 13:39
  • Just ignore the stemmer part and only use code you pasted just above. – mindas Feb 10 '15 at 18:22
  • ok i will try with this , thank – tommy Feb 11 '15 at 10:34

0 Answers0