31

I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.

I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.

By default, Lucene is very picky about query syntax. For example, I just got the following error:

[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
    ]
   Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239

What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.

It seems that there are a few possible, and complementary, strategies:

  • "Clean" the query prior to sending it to the QueryProcessor
  • Handle exceptions gracefully
    • Show an intelligent error message to the user
    • Perhaps execute a simpler query, leaving off the erroneous bit

I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?

Winston Fassett
  • 3,500
  • 3
  • 36
  • 29

6 Answers6

44

Yo can make Lucene ignore the special characters by sanitizing the query with something like

query = QueryParser.Escape(query)

If you do not want your users to ever use advanced syntax in their queries, you can do this always.

If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.

ljorquera
  • 1,070
  • 1
  • 10
  • 13
  • I had the ParseException problem and I used this solution because my users won't use advanced syntax. Thanks ! – Costo Feb 27 '09 at 23:38
8

Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.

Query safe_query_parser(QueryParser qp, String raw_query)
  throws ParseException
{
  Query q;
  try {
    q = qp.parse(raw_query);
  } catch(ParseException e) {
    q = null;
  }
  if(q==null)
    {
      String cooked;
      // consider changing this "" to " "
      cooked = raw_query.replaceAll("[^\w\s]","");
      q = qp.parse(cooked);
    }
  return q;
}

This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.

You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer

Jay Kominek
  • 8,674
  • 1
  • 34
  • 51
3

FYI... Here is the code I am using for .NET

private Query GetSafeQuery(QueryParser qp, String query)
{
    Query q;
    try 
    {
        q = qp.Parse(query);
    } 

    catch(Lucene.Net.QueryParsers.ParseException e) 
    {
        q = null;
    }

    if(q==null)
    {
        string cooked;

        cooked = Regex.Replace(query, @"[^\w\.@-]", " ");
        q = qp.Parse(cooked);
    }

    return q;
}
Rey
  • 3,663
  • 3
  • 32
  • 55
josefresno
  • 440
  • 3
  • 12
  • This answer basically copied a previous answer. – james.garriss Jun 26 '15 at 11:24
  • @james.garriss I know this thread is long dead but I had to say it. Although you might be right, but it helped me make sure that it will work as expected in C# too. Also, the Regex in this answer is more complete. :) – Rojan Gh. Nov 22 '17 at 10:45
1

I'm in the same situation as you.

Here's what I do. I do catch the exception, but only so that I can make the error look prettier. I don't change the text.

I also provide a link to an explanation of the Lucene syntax which I have simplified a little bit:
http://ifdefined.com/btnet/lucene_syntax.html

Corey Trager
  • 22,649
  • 18
  • 83
  • 121
1

I do not know much about Lucene.net. For general Lucene, I highly recommend the book Lucene in Action. For the question at hand, it depends on your users. There are strong reasons, such as ease of use, security and performance, to limit your users' queries. The book shows ways to parse the queries using a custom parser instead of QueryParser. I second Jay's idea about the BooleanQuery, although you can build stronger queries using a custom parser.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
1

If you don't need all Lucene features, you might go better by writing your own query parser. It's not as complicated as it might seem in the first place.

Stefan Schultze
  • 9,240
  • 6
  • 35
  • 42