10

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

String s = "Page 3 of 23"

If I substitute all digits by \d

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (e.g. "Page 7 of 47"). My problem is that if I do this naively some of the metacharacters such as (){}-, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

NOTE: @dasblinkenlight's edited answer now works for me!

peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
  • Here's an answer to the which characters question, I'm not aware of any libraries to generate regexs though: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions – Evan Knowles Apr 16 '13 at 10:16
  • @Evan thanks. I am only interested in Java so that looks like a useful resource. – peter.murray.rust Apr 16 '13 at 10:18

1 Answers1

10

Java's regexp library provides this functionality:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523