19

I am working on a classification problem to classify product reviews as positive, negative or neutral as per the training data using Lucene API.

I am using an ArrayList of Review objects - "reviewList" that stores the attributes for each review while crawling the web pages.

The review attributes which include "polarity" & "review content" are then indexed using the indexer. Thereafter, based on the indexes objects, I need to classify the remaining review objects. But while doing so, there is a review object for which the Query parser is encountering an EOF character in the "review content", and hence terminating.

The line causing error has been commented accordingly -

    IndexReader reader = IndexReader.open(FSDirectory.open(new File("index")));
    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
    QueryParser parser = new QueryParser(Version.LUCENE_31, "Review", analyzer);

    int length = Crawler.reviewList.size();
    for (int i = 200; i < length; i++) {
        String true_class;
        double r_stars = Crawler.reviewList.get(i).getStars();

        if (r_stars < 2.0) {
            true_class = "-1";
        } else if (r_stars > 3.0) {
            true_class = "1";
        } else {
            true_class = "0";
        }

        String[] reviewTokens = Crawler.reviewList.get(i).getReview().split(" ");
        String parsedReview = "";

        int j;

        for (j = 0; j < reviewTokens.length; j++) {
            if (reviewTokens[j] != null) {
                if (!((reviewTokens[j].contains("-")) || (reviewTokens[j].contains("!")))) {
                    parsedReview += reviewTokens[j] + " ";
                }
            } else {
                break;
            }
        }

        Query query = parser.parse(parsedReview); // CAUSING ERROR!!

        TopScoreDocCollector results = TopScoreDocCollector.create(5, true);
        searcher.search(query, results);
        ScoreDoc[] hits = results.topDocs().scoreDocs;

I've parsed the text manually to remove the characters that are causing the error, apart from checking if the next string is null...but the error persists.

This is the error stack trace -

Exception in thread "main" org.apache.lucene.queryParser.ParseException: Cannot parse 'I made the choice ... be all "thumbs ': Lexical error at line 1, column 938.  Encountered: <EOF> after : "\"thumbs "
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:216)
at Sentiment_Analysis.Classification.classify(Classification.java:58)
at Sentiment_Analysis.Main.main(Main.java:17)
Caused by: org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 938.  Encountered: <EOF> after : "\"thumbs "
at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1229)
at org.apache.lucene.queryParser.QueryParser.jj_scan_token(QueryParser.java:1709)
at org.apache.lucene.queryParser.QueryParser.jj_3R_2(QueryParser.java:1598)
at org.apache.lucene.queryParser.QueryParser.jj_3_1(QueryParser.java:1605)
at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1585)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1280)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1266)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1266)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
... 2 more
Java Result: 1

Please help me solve this problem...have been banging my head with this for hours now!

Reema
  • 1,147
  • 1
  • 9
  • 11

2 Answers2

38

You should escape the double quote and other special characters via

Query query = parser.parse(QueryParser.escape(parsedReview));

As the QueryParser.escape Javadoc suggested,

Returns a String where those characters that QueryParser expects to be escaped are escaped by a preceding '\'.

John Topley
  • 113,588
  • 46
  • 195
  • 237
Pau Kiat Wee
  • 9,485
  • 42
  • 40
  • 1
    Thanks a ton! It was spot on.. :D – Reema Apr 21 '12 at 16:01
  • 1
    For those who use a more recent releases(Lucene 4.6 for me), the `escape` function has been moved to `QueryParserUtil` class. – Chunliang Lyu Jan 24 '14 at 11:32
  • 1
    I want to make this using solr library instead of lucene library, any idea? – Divyang Shah Apr 02 '15 at 06:00
  • @ChunliangLyu in Lucene 4.10.4 escape() is still in QueryParser (inherited from QueryParserBase), but there is also one in QueryParserUtil as you mention. -I wonder what the difference is..? – Superole Dec 04 '15 at 16:11
  • @Superole Yes you are right, the QueryParser inherits the method from QueryParserBase. I have checked the implementations [QueryParserBase](https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queryparser/src/java/org/apache/lucene/queryparser/classic/QueryParserBase.java) and [QueryParserUtil](https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/QueryParserUtil.java) in the current revision, turns out they are exactly the same. So no functionality difference, perhaps some tiny little performance difference. – Chunliang Lyu Dec 05 '15 at 02:53
  • Is it considered a vulnerability if users can put in & parse arbitrary values that aren't escaped? – Aaron Esau Oct 22 '17 at 06:31
2

I recognise this problem.

Declaring the GROUP BY before the WHERE declaration works fine in Teradata, but throws an error while parsing.

To fix, move the GROUP BY declaration after the WHERE declaration.

WonderWorker
  • 8,539
  • 4
  • 63
  • 74