11

Regular expressions allows for the pattern matching syntax shown below. I'm trying to implement a powerful search tool that implements as many of these as possible. I'm told that edismax is the most flexible tool for the job. Which of the pattern matching expressions below can be accomplished with edismax? Can I do better than edismax? Can you suggest which filters and parser patches I might use to work towards achieving this functionality? Am I dreaming if I think Solr can achieve acceptable performance (i.e. server-side processing time) of these kinds of searches?

regular expression syntax & examples from mysql

  1. ^ match beginning of string. 'fofo' REGEXP '^fo' => true
  2. $ match end of string. 'fo\no' REGEXP '^fo\no$' => true
  3. * 0-unlimited wildcard. 'Baaaan' REGEXP 'Ba*n' => true
  4. ? 0-1 wildcard. 'Baan' REGEXP '^Ba?n => false'
  5. + 1-unlimited wildcard. 'Bn' REGEXP 'Ba+n' => false
  6. | or. 'pi' REGEXP 'pi|apa' => true
  7. ()* sequence match. 'pipi' REGEXP '^(pi)*$' => true
  8. [a-dX], [^a-dX] character range/set 'aXbc' REGEXP '[a-dXYZ]' => true
  9. {n} or {m,n} cardinality notation 'abcde' REGEXP 'a[bcd]{3}e' => true
  10. [:character_class:] 'justalnums' REGEXP '[[:alnum:]]+' => true
ted.strauss
  • 4,119
  • 4
  • 34
  • 57

2 Answers2

15

Version 4.0 of Lucene will support regex queries directly in the standard query parser using special syntax. I verified that it works on an instance of Solr I am running, built from the subversion trunk in February.

Jira ticket 2604 describes the extension of the standard query parser using special regex syntax, using forward slashes to delimit the regex, similar to syntax in Javascript. It seems to be using the underlying RegexpQuery parser.

So a brief example:

body:/[0-9]{5}/

will match on a five-digit zip code in the textual corpus I have indexed. But, oddly, body:/\d{5}/ did not work for me, and ^ failed as well.

The regex dialect would have to be Java's, but I'm not sure if everything in it works, since I have only done a cursory examination. One would probably have to look carefully at the RegexpQuery code to understand what works and what doesn't.

d-_-b
  • 21,536
  • 40
  • 150
  • 256
Ronald Wood
  • 166
  • 2
  • I dug a little further. There is a [page that describes the supported syntax](https://builds.apache.org/job/Lucene-trunk/javadoc/core/org/apache/lucene/util/automaton/RegExp.html). The regex engine is not Java's after all, but one implemented in Lucene in the org.apache.lucene.util.automaton package. See also the documentation for [RegexpQuery](https://builds.apache.org/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/RegexpQuery.html). – Ronald Wood Mar 06 '12 at 01:19
  • Just tried `\d{4}` in Solr 4.0 on a string field. It does not work. Looks like we can only use `[0-9]{4}`. However I guess ^ is not needed, since any query like `/[0-9]{5}/` is actually equivalent to the Perl-Compatible RegEx `/^[0-9]{5}$/` i.e. not using `.*` as prefix means you are forcing the match from the first char. – arun Feb 17 '13 at 20:03
  • 1
    @RonaldWood Both links you posted are now dead. – BlackVegetable Jul 17 '13 at 23:51
  • 1
    The Lucene project moved its javadocs. Some of their own links are broken too. Try these updated links: [RegExp](http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/automaton/RegExp.html) and [RegexpQuery](http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/RegexpQuery.html) – Ronald Wood Aug 29 '13 at 19:38
  • 3
    ElasticSearch has a [good overview of the query syntax](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax) – Renaud Aug 27 '14 at 08:41
4

Regular expressions and (e)dismax are not really comparable. Dismax is meant to work directly with common end-user input, while regular expressions are not typical end-user input.

Also, matching regular-expression-like things with dismax depends largely on text analysis settings and schema design, not on dismax itself. With Solr you typically tailor the schema and text analysis to the concrete search need, possibly doing much of the work at index-time. Regular expressions are at odds with this and even with the basic structure of Lucene inverted indices.

Still, Lucene provides RegexQuery and the newer RegexpQuery. As far as I know, these are not integrated with Solr, but they could be. Start a new item in the Solr issue tracker and happy coding! :)

Keep in mind that regex queries will probably always be slow... but they could have acceptable performance in your case.

Mauricio Scheffer
  • 98,863
  • 23
  • 192
  • 275