I'm using Apache Tika to parse xml document before indexing with Apache Lucene.
This is Tika part:
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(f);
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
return handler.toString();// return simple text
I use StandardAnalyzer with stop words list to Tokenize my document :
analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET); // using stop words
Can I discard numeric terms because I dont need it?
Thanks for your help.