2

I want to parse some text using Lucene query parser to carry out basic text preprocessing on the texts. I used following lines of code:

Analyzer analyzer = new EnglishAnalyzer();
QueryParser parser = new QueryParser("", analyzer);
String text = "...";
String ret = parser.parse(QueryParser.escape(text)).toString();

But, I am getting an error:

Exception in thread "main" org.apache.lucene.queryparser.classic.ParseException: Cannot parse '': Encountered "<EOF>" at line 1, column 0.
Krishnendu Ghosh
  • 314
  • 4
  • 21
  • What version of lucene are you using? What is the empty parameter in the QueryParser? – Federico Piazza Sep 01 '16 at 16:58
  • I am using Lucene 6.1.0.The empty parameter is a String which if passed with a value like: "val" then, the variable text: "how to get the `` your battery is broken '' message to go away" after preprocessing shows: "val:how val:get val:your val:batteri val:broken val:messag val:go val:awai". I don't want the "val:" to come inside the pre-processed line, hence I kept it blank(""). – Krishnendu Ghosh Sep 01 '16 at 17:08
  • @FedericoPiazza: IS it because of length of the string? For the string I am getting error, is a very long one!! – Krishnendu Ghosh Sep 01 '16 at 17:20
  • 1
    Your code doesn't throw an exception for me, but does generate an empty string (expectedly). Is this where the exception is being thrown? Your exception is caused by attempting to parse an empty string, so perhaps this result is being reparsed somewhere? – femtoRgon Sep 01 '16 at 17:45
  • Try to re-index your data. – Mehdi Dehghani Sep 01 '16 at 20:34

2 Answers2

3

Using Query.escape() removes the special characters. However it doesn't remove

AND, NOT, OR

which are keywords used in lucene search.

There are two ways to deal with it :

  1. Replace AND, NOT, OR in the query string.
  2. Convert the query string to lower case.

Converting to lower case resolves the issue as only the capitalized AND, NOT, OR are keywords. They are treated as a regular word in lower case.

Joyson
  • 3,025
  • 1
  • 20
  • 34
1

for those who face this problem, I realized that my parser throw exception for the word "NOT", even after escaped. I had to manually replace it by other word.

Digao
  • 520
  • 8
  • 22