15

I have this:

import java.util.regex.*;

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
  MatchResult matchResult = m.toMatchResult();
  String substring = s.substring(matchResult.start(), matchResult.end());
  System.out.println(substring);
}

The above only prints hello whereas I want it to print hello world.

One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...

So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

Community
  • 1
  • 1
pathikrit
  • 32,469
  • 37
  • 142
  • 221
  • 4
    Use `"(?hello world)|(?hello|universe)"`, put the longest alternative branch before the shortest. – Wiktor Stribiżew Feb 28 '17 at 00:35
  • 2
    @WiktorStribiżew: You did not read my question. I clearly said that is an alternative that I cannot pursue. – pathikrit Feb 28 '17 at 00:36
  • 1
    @WiktorStribiżew: I suggested 2 different solutions in my question (one by reordering groups which I don't have control over since I get the regex from a source I do not control and by checking all possible substrings which is too slow). Suggesting 2 different solutions obviously means I admit that there are multiple ways. – pathikrit Feb 28 '17 at 00:38
  • 3
    @pathikrit If you're inserting the alternation dynamically, you have enough control to re-sort (I prefer reverse-alphabetically for a dynamic scenario like this). I just answered a similar question the other day http://stackoverflow.com/questions/42432356/regex-order-of-ord-values-in-capture-group-changes-results/42447293#42447293 – Regular Jo Feb 28 '17 at 02:25
  • 1
    `((?:universe|hello(?: world)?)` The only way you wouldn't have control over the regex is when it is being input externally and run into an engine. If that's the case, you should not be trying to optimize that as it has nothing to do with you. But, I guess you could parse the string your self and do it that way. And, I hope you wrap the expression in try / catch at instantiation because it will blow up. –  Feb 28 '17 at 03:36
  • 1
    Thanks for the suggestions but I have absolutely no control on the pattern I am getting. Given, this requirement (you have an instance of `Pattern` and a `String`), how can I find the longest substring of the String which matches given pattern? – pathikrit Feb 28 '17 at 17:14
  • 2
    You are a funny guy, pathikrit. You want to cure a disease by looking at the symptoms instead of the root cause. This is like taking Aspirin against a headache caused by an operable brain tumor. If the regex is outside of your control and does the wrong thing, either do not use it at all but your own regex or talk to the guy maintaining the regex generator in order to make him fix it for you upstream. – kriegaex Mar 05 '17 at 13:30
  • 1
    @kriegaex: I do not have control of that - its a library that I do not maintain (its not even written in Java) which produces named groups `(?(t1|t2))|(?(t3|t4))` given a multimap `g1 -> Set(t1, t2); g2 -> Set(t3, t4)`. – pathikrit Mar 06 '17 at 15:18
  • 1
    As I said: You do not need to be in control in order to talk to upstream developers and the implementation language of your library is also irrelevant for that matter. But be it as it might, good luck for your workaround. – kriegaex Mar 06 '17 at 16:26
  • 1
    Do you want to ensure that the entire string matches or any part of it? If the latter - there is no generic solution for your case. The regex might have lookarounds, hence if you slice it and try to match the substrings, the lookaround might not match, even if it did in the origin string. – ndnenkov Mar 08 '17 at 14:16
  • 1
    @ndn: Any part - and in my case, the regex is pretty vanilla ORs – pathikrit Mar 08 '17 at 15:04
  • 3
    @pathikrit, if the regex is always very simple and consistent, just edit it and use the edited version to match. – ndnenkov Mar 08 '17 at 15:11
  • 1
    By edit the regex I don't mean edit the code that generates it, but get the generated regex and construct a new one based on it. – ndnenkov Mar 09 '17 at 15:17

4 Answers4

4

Here is a way of doing it using matcher regions, but with a single loop over the string index:

public static String findLongestMatch(String regex, String s) {
    Pattern pattern = Pattern.compile("(" + regex + ")$");
    Matcher matcher = pattern.matcher(s);
    String longest = null;
    int longestLength = -1;
    for (int i = s.length(); i > longestLength; i--) {
        matcher.region(0, i);
        if (matcher.find() && longestLength < matcher.end() - matcher.start()) {
            longest = matcher.group();
            longestLength = longest.length();
        }
    }
    return longest;
}

I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.

As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.


An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:

findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world")
==> "hello world"

findLongestMatch("hello( universe)?", "hello world")
==> "hello"

findLongestMatch("hello( world)?", "hello world")
==> "hello world"

findLongestMatch("\\w+|\\d+", "12345 abc")
==> "12345"
Helder Pereira
  • 5,522
  • 2
  • 35
  • 52
  • 1
    Although you did some clever optimizations to break out early, this is still O(n^2) since matched.find() is O(n) and it is in a loop of length n – pathikrit Aug 08 '17 at 14:24
  • 1
    @pathikrit I agree with you. However, the solutions in the link you posted in the question are `O(n^3)` and not `O(n^2)`. – Helder Pereira Aug 08 '17 at 15:19
  • Alternatively, you could [find all matches of the regular expression](https://stackoverflow.com/questions/6020384/create-array-of-regex-matches) and select the longest match(es) from the list, but this might not be efficient. – Anderson Green May 15 '19 at 14:47
2

If you are dealing with just this specific pattern:

  1. There is one or more named group on the highest level connected by |.
  2. The regex for the group is put in superfluous braces.
  3. Inside those braces is one or more literal connected by |.
  4. Literals never contain |, ( or ).

Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:

private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");

public static final String findLongestMatch(String s, Pattern p) {
    Matcher m = g.matcher(p.pattern());
    List<String> literals = new ArrayList<>();
    while (m.find())
        Collections.addAll(literals, m.group(1).split("\\|"));
    Collections.sort(literals, new Comparator<String>() {
        public int compare(String a, String b) {
            return Integer.compare(b.length(), a.length());
        }
    });
    for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {
         String literal = itr.next();
         if (s.indexOf(literal) >= 0)
              return literal;
    }
    return null;
}

Test:

System.out.println(findLongestMatch(
    "hello world",
    Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: hello world
System.out.println(findLongestMatch(
    "hello universe",
    Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: universe
maraca
  • 8,468
  • 3
  • 23
  • 45
1

just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.

The below code gives what you want

import java.util.regex.*;
public class RegTest{
  public static void main(String[] arg){
        String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))";
        String s = "hello world";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(s);
        while(matcher.find()) {
            MatchResult matchResult = matcher.toMatchResult();
            String substring = s.substring(matchResult.start(), matchResult.end());
            System.out.println(substring);
        }
    }
}

Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there

import java.util.regex.*;
public class RegTest{
  public static void main(String[] arg){
        String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))";
        String s = "hello world there";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(s);
        while(matcher.find()) {
            MatchResult matchResult = matcher.toMatchResult();
            String substring = s.substring(matchResult.start(), matchResult.end());
            System.out.println(substring);
        }
    }
}
Sagar V
  • 12,158
  • 7
  • 41
  • 68
  • I like your idea :) – Vitali Pom Mar 11 '17 at 13:27
  • 4
    I don't think this is correct, e.g. "hello, world!" you won't match anything, but the expression should match "hello" or "hello universe!" should match "universe" but you won't match anything again. – maraca Mar 11 '17 at 15:55
  • 4
    Or "hello world hello" you will find "hello" but you should find "hello world". What you are doing is basically a `endsWith()` which doesn't ensure you find the longest match. – maraca Mar 11 '17 at 16:09
1

If the structure of the regex is always the same, this should work:

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";

//split the regex into the different groups
String[] allParts = regex.split("\\|\\(\\?\\<");
for (int i=1; i<allParts.length; i++) {
    allParts[i] = "(?<" + allParts[i];
}

//find the longest string
int longestSize = -1;
String longestString = null;
for (int i=0; i<allParts.length; i++) {
    Pattern pattern = Pattern.compile(allParts[i]);
    Matcher matcher = pattern.matcher(s);
    while(matcher.find()) {
        MatchResult matchResult = matcher.toMatchResult();
        String substring = s.substring(matchResult.start(), matchResult.end());
        if (substring.length() > longestSize) {
            longestSize = substring.length();
            longestString = substring;
        }
    }
}
System.out.println("Longest: " + longestString);
user7291698
  • 1,972
  • 2
  • 15
  • 30