Regex for Gmail-like search

Question

I've been trying to figure out a regex for Gmail-like search, i.e.:

name:Joe surname:(Foo Bar)

...like in this topic. But with a slight difference: if there is a text without a key:, it's also split, so:

foo:(hello world) bar:(-{bad things}) some text to search

would return:

foo:(hello world)
bar:(-{bad things})
some text to search

Are you really limited to regex, here? Regex is great because it offers a simple solution for a lot of string problems, and it often seems the solution for any such problem. But [appearances can be deceiving](http://ex-parrot.com/~pdw/Mail-RFC822-Address.html). — cheeken, May 25 '12 at 14:21
@Eitan - Good edit, but you shouldn't have removed the `.net` tag - not all regular expression engines are the same, and people may offer a non-regex solution in the same language (like Chris's answer). In general, when people use the `regex` tag, we encourage them to also add the language. — Kobi, May 26 '12 at 08:18
@Kobi, I didn't see any reference to .NET in the question so I decided to remove the tag. I appreciate the heads up though, thanks. — Eitan T, May 26 '12 at 10:25

Chris S · Answer 1 · 2012-05-27T02:35:27.283

The problem you face with going the Regex route is you run into issues with spaces. There is probably a really complicated Regex to do this, but for a simple regex you'll find your searches can't contain spaces for keywords, e.g.:

Works: site:mysite user:john
Fails: site:"my awesome site" user:john

This will fail because it is tokenizing based on spaces. So if space support is a requirement read on...

I would recommend either using the Lucene .NET engine's inbuilt parser to give you the tokens, or using a grammar and a parser such as GoldParser, Irony or Antlr.

It might sound too long and complicated for what you want, but having written a grammar for GoldParser to do exactly what you're doing, it is actually quite easy once the grammar is done. Here's an example of the grammar:

"Name"     = 'Spruce Search Grammar'
"Version"  = '1.1'
"About"    = 'The search grammar for Spruce TFS MVC frontend'

"Start Symbol" = <Query>

! -------------------------------------------------
! Character Sets
! -------------------------------------------------
{Valid} = {All Valid} - ['-'] - ['OR'] - {Whitespace} - [':'] - ["] - ['']
{Quoted} = {All Valid} - ["] - ['']

! -------------------------------------------------
! Terminals
! -------------------------------------------------
AnyChar    = {Valid}+
Or = 'OR'
Negate = ['-']
StringLiteral   = '' {Quoted}+ '' | '"' {Quoted}+ '"'

! -- Field-specific terms
Project     = 'project' ':'
...
CreatedOn   = 'created-on' ':'
ResolvedOn  = 'resolved-on' ':'
! -------------------------------------------------
! Rules
! -------------------------------------------------

! The grammar starts below
<Query> ::= <Query> <Keywords> | <Keywords>
<SingleWord> ::= AnyChar

<Keywords> ::= <SingleWord>
              | <QuotedString> 
              | <Or> 
              | <Negate> 
              | <FieldTerms>

<Or> ::= <Or> <SingleWord> 
        | Or Negate
        | Or <SingleWord>
        | Or <QuotedString>

<Negate> ::= <Negate> Negate <SingleWord>
            | <Negate> Negate <QuotedString>
            | Negate <SingleWord>
            | Negate <QuotedString>

<QuotedString> ::= StringLiteral

<FieldTerms> ::= <FieldTerms> Project | <FieldTerms> Description | <FieldTerms> State 
                | <FieldTerms> Type | <FieldTerms> Area | <FieldTerms> Iteration 
                | <FieldTerms> AssignedTo | <FieldTerms> ResolvedBy 
                | <FieldTerms> ResolvedOn | <FieldTerms> CreatedOn
                | Project 
                | <Description>
                | State 
                | Type 
                | Area 
                | Iteration 
                | CreatedBy
                | AssignedTo 
                | ResolvedBy
                | CreatedOn
                | ResolvedOn

<Description> ::= <Description> Description | <Description> Description StringLiteral
                | Description | Description StringLiteral

This gives you search support for something like:

resolved-by:john project:"amazing tfs project"

If you look at the Keywords token, you can see it is expecting a singleword, an OR, a quoted string, or a negative (a NOT). The hard part comes when this definition becomes recursive, which you see in the <Description> part.

The syntax is called EBNF which describes the format of your language. You can write something as simple as a search query parser in it, or an entire computer language. The way Goldparser parses the tokens will restrict you, as it looks ahead for tokens (LALR), so languages such as HTML and Wiki syntax will break any grammar you attempt to write, as these formats don't force you to close tags/tokens. Antlr gives you LL(*) which is more forgiving of missing start tags/tokens but isn't something you don't need to worry about for a search query parser.

The code folder for my grammar and C# code can be found in this project.

QueryParser is the class that parses the search string, the grammar file is the .grm file, the 2mb file is how Goldparser optimises your grammar to basically create its own table of possibilities. Calitha is the C# library for GoldParser, and is easy enough to implement. Without writing an even larger answer it's hard to describe exactly how it's done, but it's fairly straightforward once you have compiled the grammar, and Goldparser has a very intuitive IDE for writing grammars with and a huge set of existing ones such as SQL, C#, Java and even a Perl regex I believe.

It's not a 1 hour quick fix as you'd get from a regex though, closer to 2-3 days, however you do learn the 'proper' way of parsing.

although is't not the answer I was looking for I will give it a try and see if it works for me — Jacob, May 25 '12 at 12:58
Beautiful answer. It is very easy to parrot "use a parser", but you've showed how it's done. — Kobi, May 25 '12 at 20:24

ohaal · Accepted Answer · 2012-05-29T00:44:28.873

There's no way to grab everything you need using a single regex. The problem is that there is not a reliable way of grabbing the non-keyworded text.

However, if we first grab and store all the keyworded text, then do a regex replace (using the same regular expression) with an empty string, we suddenly get the search string by itself!

Grab the keywords and related text using the following regex (see it on RegExr):
```
([a-zA-Z]+:(?:\([^)]+?\)|[^( ]+))
```
Then do a regex replace, with the same regex, on the full search string using an empty string. The resulting string will be the non-keyworded search text. Something along the lines of:
```
Regex.Replace(searchtext, @"[a-zA-Z]+:(?:\([^)]+?\)|[^( ]+)", "");
```
Perform whitespace trimming at start and end of the search text
Remove double (or more spaces) from search text (can be done with regex replace, replacing with a single space):
```
Regex.Replace(searchtext, @" {2,}", " ");
                            ^-- notice the space :)
```
????
PROFIT!!!

It is entirely possible to perform whitespace removal in the regex in #2, but when dealing with regexes, I tend to prefer keeping it as clean as possible.

Winner, beacouse it contains the exact regex I was looking for. In addition to this answer I will add that simple Regex Split with "([a-zA-Z]+:(?:\([^)]+?\)|[^( ]+))" did the job. Thanks @ohaal, profit indeed ! — Jacob, May 29 '12 at 11:23

score 0 · Answer 3 · edited May 23 '17 at 12:00

You may like to take a look at this question.

It contains the following Regex sample:

^((?!hede).)*$

As the author of the answers states "The regex above will match any string, or line without a line break, not containing the (sub) string 'hede'. "

Therefore you should be able to combine this with the information from the topic you posted and the above piece of Regex to solve your problem.

Hope this helps!!!

score 0 · Answer 4 · answered May 25 '12 at 12:29

This might work for you

In Java:

p = Pattern.compile("(\\w+:(\\(.*?\\))|.+)\\s*");
m = p.matcher("foo:(hello world) bar:(-{bad things}) some text to search");
while(m.find()){
    Log.v("REGEX", m.group(1));
}

Produces:

05-25 15:21:06.242: V/REGEX(18203): foo:(hello world)
05-25 15:21:08.061: V/REGEX(18203): bar:(-{bad things})
05-25 15:21:09.761: V/REGEX(18203): some text to search

The regex works as long as the tags are first and the free text is last.
Even for tags you can get the content with m.group(2)

Kobi · Answer 5 · 2012-05-25T22:04:33.407

A simple approach here is to match the string with this pattern:

\w+:(?:\([^)]*\)|\S+)|\S+

That would match:

\w+: - a key.
(?:) - followed by...
- \([^)]*\) - parentheses
- | - or
- \S+ - some characters that are not space.
|\S+ - or just match a single word.

Note that this pattern breaks words into different matches. If you really can't handle that, you may use something like |(?:\S+(\s+(?!\w*:)[^\s:]+)*) instead of the last |\S+.

Working example: http://ideone.com/bExFd

score 0 · Answer 6 · answered May 25 '12 at 21:57

Another option, a little more robust:
Here we can use a somewhat advanced feature of .Net patterns - they keep all captures of all groups. This is a useful feature to build a full parser. Here, I've included some other search features like quoted string and operators (OR or range .., for example):

\A
(?>
    \s                      # skip over spaces.
    |
    (?<Key>\w+):            # Key:
    (?:                     # followed by:
        \(                     
        (?<KeyValue>[^)]*)      # Parentheses
        \)
        |                       # or
        (?<KeyValue>\S+)        # a single word
    )
    |
    (?<Operator>OR|AND|-|\+|\.\.)
    |
    ""(?<Term>[^""]*)""     # quoted term
    |
    (?<Term>\w+)            # just a word
    |
    (?<Invalid>.)           # Any other character isn't valid
)*
\z

You can now easily get all tokens and their positions (you can also zip the Key and KeyValue captures to pair them):

Regex queryParser = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
Match m = queryParser.Match(query); // single match!
// ...
var terms = m.Groups["Term"].Captures;

Working example: http://ideone.com/B7tln

I've decided to split these into two answers - they are two different solutions. My other answer should work on every language, but this one is unique to .Net, as far as I know. — Kobi, May 25 '12 at 21:58

score 0 · Answer 7 · edited May 23 '17 at 12:06

You don't need to solve this problem using only one regular expression. You can re-use the answer that you linked to that you indicated would partially work.

The last array element is the only one that needs to be corrected.

Using your example you'd initially get:

[
    "foo:(hello world)",
    "bar:(-{bad things}) some text to search"
]

The last item needs to be split into text up to and including the first closing bracket and text following it. You'd then replace the last item with the text up to and including the bracket and then you'd append the text following it to the array.

[
    "foo:(hello world)",
    "bar:(-{bad things})",
    "some text to search"
]

The following pseudo code should explain how this can be done:

array; // Array returned when string was split using /\s+(?=\w+:)/
lastPosition = array.length-1;

lastElem = array[lastPosition]; // May contain text without a key

// Key is followed by an opening bracket
//  (check for opening bracket after semi-colon following key)
if ( lastElem.match( /^[^:]*:(/ ) ) {
    // Need to replace array entry with key and all text up to and including
    // closing bracket.
    // Additional text needs to be added to array.

    maxSplitsAllowed = 1;
    results = lastElem.split( /)\w*/ , maxSplitsAllowed );
    // White space following the bracket was included in the match so it
    //  wouldn't be at the front of the text without a key

    lastKeyAndText = results[0] + ')'; // Re-append closing bracket
    endingTextWithoutKey = results[1];

    array[lastPosition] = lastKeyAndText; // Correct array entry for last key
    array.append( endingTextWithoutKey ); // Append text without key

// Key is not followed by a closing bracket but has text without a key
//  (check for white space following characters that aren't white space
//   characters)
} else if (lastElem.match( /^[^:]*:[^\w]*\w/ )) {
    // Need to change array entry so that all text before first space
    // becomes the key.
    // Additional text needs to be added to array.

    maxSplitsAllowed = 1;
    results = lastElem.split( /\w+/ , maxSplitsAllowed );

    lastKeyAndText = results[0];
    endingTextWithoutKey = results[1];

    array[lastPosition] = lastKeyAndText; // Correct array entry for last key
    array.append( endingTextWithoutKey ); // Append text without key
}

I assumed that brackets are required when white space characters are to be included within text that follows a key.

Regex for Gmail-like search

7 Answers7