32

I have a list of about 120 thousand english words (basically every word in the language).

I need a regular expression that would allow searching through these words using wildcards characters, a.k.a. * and ?.

A few examples:

  • if the user searches for m?st*, it would match for example master or mister or mistery.
  • if the user searches for *ind (any word ending in ind), it would match wind or bind or blind or grind.

Now, most users (especially the ones who are not familiar with regular expressions) know that ? is a replacement for exactly 1 character, while * is a replacement for 0, 1 or more characters. I absolutely want to build my search feature based on this.

My questions is: How do I convert what the user types (m?st* for example) to a regular expression ?

I searched the web (obviously including this website) and all I could find were tutorials that tried to teach me too much or questions that were somewhat similar, but not enough as to provide an answer to my own problem.

All I could figure out was that I have to replace ? with .. So m?st* becomes m.st*. However, I have no idea what to replace * with.

Any help would be greatly appreciated. Thank you.

PS: I'm totally new to regular expressions. I know how powerful they can be, but I also know they can be very hard to learn. So I just never took the time do to it...

Radu Murzea
  • 10,724
  • 10
  • 47
  • 69
  • 1
    possible duplicate of [Is there an equivalent of java.util.regex for "glob" type patterns?](http://stackoverflow.com/questions/1247772/is-there-an-equivalent-of-java-util-regex-for-glob-type-patterns) – NPE May 09 '12 at 16:55
  • 2
    Bear in mind that any *other* regex characters which might appear in your query will have to be escaped too. If someone types in `^\w..` you probably don't want to pass that through to your regular expression engine in its raw form – Gareth May 09 '12 at 16:57
  • @SoboLAN : can you please share the collection of words , i kind of need it to develop a dictionary for my requirement – Hussain Akhtar Wahid 'Ghouri' Feb 07 '13 at 09:15
  • @HussainAkhtarWahid I got them from the database of another program, I don't remember the link and I don't have it in my browser history anymore. However, I uploaded them here: http://www.2shared.com/file/elLSFPDx/dictionarywords.html . Each row in the file represents 1 word. This is the format: `word|definition1;definition2;definition3`. So the separators are `|` and `;`. Note: there can be any number of definitions (1, 2, 3 etc.). Hope this helps. Good luck. – Radu Murzea Feb 07 '13 at 09:24
  • seems like enough , thanx for the quick and sufficient response – Hussain Akhtar Wahid 'Ghouri' Feb 07 '13 at 09:28

9 Answers9

27

Unless you want some funny behaviour, I would recommend you use \w instead of .

. matches whitespace and other non-word symbols, which you might not want it to do.

So I would replace ? with \w and replace * with \w*

Also if you want * to match at least one character, replace it with \w+ instead. This would mean that ben* would match bend and bending but not ben - it's up to you, just depends what your requirements are.

gnomed
  • 5,483
  • 2
  • 26
  • 28
  • Question says "while `*` is a replacement for 0, 1 or more characters" – Gareth May 09 '12 at 17:00
  • 2
    @Gareth ya, i saw that. Just thought I would offer the extra info. – gnomed May 09 '12 at 17:01
  • @gnomed Why is `\w` better than `.` ? – Radu Murzea May 09 '12 at 17:06
  • 1
    @SoboLAN was literally editing my answer as you commented that :). See the edit above, but basically a `.` matches whitespace which I dont think you would want. – gnomed May 09 '12 at 17:08
  • Without a `?` symbol after the `\w*` to make `\w*?`, the expression becomes greedy, which may not be what he wants, and at the least could slow the search down tremendously. – Dan W Feb 22 '19 at 06:20
  • That comment might need some elaboration. It sounds like the OP wants greedy behaviour based on the description and the fact that the users are non-technical they won't even understand greedy vs non greedy. It sounds like they would expect `te*g` to match whole words starting with `te` and ending in `g` while `te\w*?g` matches it also matches any word starting with `te` that has a `g` *anywhere* in it which is probably unexpected behaviour in the use case described by the OP (if that is what the user wanted they would have said `te*g*`). Could use an anchor but then its basically greedy. – gnomed Feb 22 '19 at 16:22
9

Take a look at this library: https://github.com/alenon/JWildcard

It wraps all not wildcard specific parts by regex quotes, so no special chars processing needed: This wildcard:

"mywil?card*"

will be converted to this regex string:

"\Qmywil\E.\Qcard\E.*"

If you wish to convert wildcard to regex string use:

JWildcard.wildcardToRegex("mywil?card*");

If you wish to check the matching directly you can use this:

JWildcard.matches("mywild*", "mywildcard");

Default wildcard rules are "?" -> ".", "" -> ".", but you can change the default behaviour if you wish, by simply defining the new rules.

JWildcard.wildcardToRegex(wildcard, rules, strict);

You can use sources or download it directly using maven or gradle from Bintray JCenter: https://bintray.com/yevdo/jwildcard/jwildcard

Gradle way:

compile 'com.yevdo:jwildcard:1.4'

Maven way:

<dependency>
  <groupId>com.yevdo</groupId>
  <artifactId>jwildcard</artifactId>
  <version>1.4</version>
</dependency>
Ilya Serbis
  • 21,149
  • 6
  • 87
  • 74
Yevdo Abramov
  • 609
  • 4
  • 10
8

Replace ? with . and * with .*.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
6

Here is a way to transform wildcard into regex:

  1. Prepend all special characters ([{\^-=$!|]}).+ with \ - so they are matched as characters and don't make user experience unexpected. Also you could enclose it within \Q (which starts the quote) and \E (which ends it). Also see paragraph about security.
  2. Replace * wildcard with \S*
  3. Replace ? wildcard with \S?
  4. Optionally: prepend pattern with ^ - this will enforce exact match with the beginning.
  5. Optionally: append $ to pattern - this will enforce exact match with the end.

    \S - stand for non-space character, which happens zero or more times.

Consider using reluctant (non-greedy) quantifiers if you have characters to match after * or +. This can be done by adding ? after * or + like this: \S*? and \S*+?

Consider security: user will send you code to run (because regex is kind of a code too, and user string is used as the regex). You should avoid passing unescaped regex to any other parts of application and only use to filter data retrieved by other means. Because if you do user can affect speed of your code by supplying different regex withing wildcard string - this could be used in DoS attacks.

Example to show execution speeds of similar patterns:

seq 1 50000000 > ~/1
du -sh ~/1
563M
time grep -P '.*' ~/1 &>/dev/null
6.65s
time grep -P '.*.*.*.*.*.*.*.*' ~/1 &>/dev/null
12.55s
time grep -P '.*..*..*..*..*.*' ~/1 &>/dev/null
31.14s
time grep -P '\S*.\S*.\S*.\S*.\S*\S*' ~/1 &>/dev/null
31.27s

I'd suggest against using .* simply because it can match anything, and usually things are separated with spaces.

Bohdan
  • 16,531
  • 16
  • 74
  • 68
3
  1. Replace all '?' characters with '\w'
  2. Replace all '*' characters with '\w*'

The '*' operator repeats the previous item '.' (any character) 0 or more times.

This assumes that none of the words contain '.', '*', and '?'.

This is a good reference

http://www.regular-expressions.info/reference.html

Mark Sholund
  • 1,273
  • 2
  • 18
  • 32
2

. is an expression that matches any one character, as you've discovered. In your hours of searching, you undoubtedly also stumbled across *, which is a repetition operator that when used after an expression matches the preceding expression zero or more times in a row.

So the equivalent to your meaning of * is putting these two together: .*. This then means "any character zero or more times".

See the Regex Tutorial on repetition operators.

Mark Peters
  • 80,126
  • 17
  • 159
  • 190
2

Replace * with .* (the regex equivalent of "0 or more of any character").

Amber
  • 507,862
  • 82
  • 626
  • 550
0
function matchWild(wild,name)
{
    if (wild == '*') return true;

    wild = wild.replace(/\./g,'\\.');
    wild = wild.replace(/\?/g,'.');
    wild = wild.replace(/\\/g,'\\\\');  
    wild = wild.replace(/\//g,'\\/');
    wild = wild.replace(/\*/g,'(.+?)');

    var re = new RegExp(wild,'i');
    return re.test(name);
}
Shell
  • 6,818
  • 11
  • 39
  • 70
Clif
  • 1
0

This is what I use:

String wildcardToRegex(String wildcardString) {
    // The 12 is arbitrary, you may adjust it to fit your needs depending
    // on how many special characters you expect in a single pattern.
    StringBuilder sb = new StringBuilder(wildcardString.length() + 12);
    sb.append('^');
    for (int i = 0; i < wildcardString.length(); ++i) {
        char c = wildcardString.charAt(i);
        if (c == '*') {
            sb.append(".*");
        } else if (c == '?') {
            sb.append('.');
        } else if ("\\.[]{}()+-^$|".indexOf(c) >= 0) {
            sb.append('\\');
            sb.append(c);
        } else {
            sb.append(c);
        }
    }
    sb.append('$');
    return sb.toString();
}

Special character list from https://stackoverflow.com/a/26228852/1808989.

Community
  • 1
  • 1
Andrew Sun
  • 4,101
  • 6
  • 36
  • 53