5

In NFA it is easy to make all previously non-final states accepting to make it match language of all substrings of a given language.

In Java regex engine, is there a way to find out if a string is a starting substring of a string that matches given regex?

expression regexX ~ "any start of", regexA = any normal regex

resulting expression "regexXregexA" matches all starting substrings of all matches of "regexA":

example:

regexA = a*b, matches "ab" and not "a"
  
"regexXa*b", matches "a" because it is a start of "ab" (and "aab")  

edit:

Since some people still fail to understand, here is a program test for this question:

import java.util.regex.*;
public class Test1 {
    public static void main(String args[]){
       String regex = "a*b";
       System.out.println(
       partialMatch(regex, "aaa");
       );
     }
public boolean partialMatch(String regex, String begining){
//return true if there is a string which matches the regex and    
//startsWith(but not equal) begining, false otherwise 
}
}

must result in true.

iantonuk
  • 1,178
  • 8
  • 28
  • Could you please explain what is your question about? – cn007b Feb 06 '17 at 17:15
  • Are you looking for a [word boundary](http://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regexes)? – Eric Duminil Feb 06 '17 at 17:17
  • No, it's very clearly explains in the text of the question. What is regexX? – iantonuk Feb 06 '17 at 17:18
  • you have asked about **java** with no** java** tag. – Shakiba Moshiri Feb 06 '17 at 17:18
  • I will be ok if you post an answer in any other language. – iantonuk Feb 06 '17 at 17:19
  • 3
    "very clearly" is subjective IMHO. At least one other user didn't understand what your question is about. – Eric Duminil Feb 06 '17 at 17:21
  • You comment is illogical. I'm pretty sure there are more then 1 person who don't understand even calculus even though its questions are surely clearly explained. – iantonuk Feb 06 '17 at 17:24
  • Here is example to make it completely "unsubjective": import java.util.regex.*; public class Test1 { public static void main(String args[]){ String regex = "a*b"; System.out.println( Pattern.matches(regexX+regex, "aaa") ); } } – iantonuk Feb 06 '17 at 17:29
  • Find a regexX that will result in "true" – iantonuk Feb 06 '17 at 17:31
  • What do you mean by *"`"a"` matches `"regexXa*b"`"*? – axiac Feb 06 '17 at 17:44
  • I won't explain what a regular expression is here. Please read the question: I already posted a clear programming example. Just find some string x to substitute for regexX to make program output "true"; – iantonuk Feb 06 '17 at 17:50
  • 1
    What is **regexX**? a condition to test if there is the second part? and find your pattern if would be there? – Shakiba Moshiri Feb 06 '17 at 17:55
  • k-five, I don't understand your comment completely. What is regexX? is the question that I ask in the post. Why do you reask it me in the comments.. – iantonuk Feb 06 '17 at 18:04
  • This 'regexX+regex' looks like a string cat. Even with partial matching, you'd have to know which part matched. There is no magic bullet I don't think, to force it. –  Feb 09 '17 at 22:57

2 Answers2

12

What you're looking for is called partial matching, and it's natively supported by the Java regex API (for the record, other engines which offer this feature include PCRE and boost::regex).

You can tell if an input string matched partially by inspecting the result of the Matcher.hitEnd function, which tells if the match failed because the end of the input string was reached.

Pattern pattern = Pattern.compile("a*b");
Matcher matcher = pattern.matcher("aaa");
System.out.println("Matches: " + matcher.matches());
System.out.println("Partial match: " + matcher.hitEnd());

This outputs:

Matches: false
Partial match: true
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • Thank you, that's actually 1-to-1 mapping of the NFA idea. Could you elaborate about how the matches() implemented? Is the just an underlying NFA and if the input end is hit in a nonfinal state(both not accepting and not declining) the "hitEnd" flag gets up? – iantonuk Feb 09 '17 at 14:47
  • @bedbad I don't know the implementation details, but the idea behind it is like you described: if you end up in a nonfinal state with no more possible transitions (because there's no more input), then `hitEnd` will return `true`. – Lucas Trzesniewski Feb 09 '17 at 15:01
  • @bedbad BTW I recently answered a very similar question for JavaScript [here](http://stackoverflow.com/a/41580048/3764814), you may be interested as JS doesn't provide that feature out of the box and the workaround is (roughly) to rewrite the pattern to include `$` at every "point" in the regex, which is equivalent to marking all possible NFA states as final if the end of the input was reached. – Lucas Trzesniewski Feb 09 '17 at 15:09
  • Here's your bounty, sir. II couldn't find that, the question is fairly underrated in my opinion. Is it possible to do in java and C/C++ as well? can you link the source boost:regex where it is implemented? – iantonuk Feb 09 '17 at 15:43
  • Yes, you can use PCRE for C and C++, and boost::regex for C++. Here's the [boost partial match doc](http://www.boost.org/doc/libs/1_63_0/libs/regex/doc/html/boost_regex/partial_matches.html), and here's the [PCRE2 doc about partial matching](http://www.pcre.org/current/doc/html/pcre2partial.html) – Lucas Trzesniewski Feb 09 '17 at 15:56
3

In NFA it is easy to make all previously non-final states accepting to make it match language of all substrings of a given language.

Indeed, it can be accomplished by adding a new final state and an ε-move from each state (final or non-final) to the new final state.

Afaik there is no regex equivalent for this operation.

It is possible that some regex libraries provides a way to verify if a string is a partial match of a regex, I don't know. I don't know Java, I work mainly in PHP and it doesn't provide such a feature. Maybe there are libraries that does it but I never needed one.

For a small, specific regex you can try to build a new regex that matches strings that would partially match the original regex by combining this simple rules:

  • a -> a?
  • ab -> ab?
  • a* -> a*
  • a+ -> a*
  • a|b -> (a|b)?
  • etc

a and b above are sub-regexps of the original regex. Use parentheses as needed.

axiac
  • 68,258
  • 9
  • 99
  • 134
  • Thanks for the attempt axiac! Unfortunately I can't use small and/or specific regex: I'm building Parser Generator(this part is for Lexer Generator) (so I have no idea what grammar input will be). I need to know if the string so far matches any regular expression.(because the string that finally matches it fully can be "infinitely" long(or very long)). – iantonuk Feb 06 '17 at 18:08
  • Well, if you are building a lexer you should know better than me how to convert NFAs to regexps and vice-versa. – axiac Feb 06 '17 at 18:14
  • Yes, it is a solution: convert all runtime regexs to NFAs, make that operation, convert back and do matching. I stumbled across this problem completely unexpectedly and felt it should take no more than 3 lines of code. I don't know if there is full implementation for NFAs in java standard packages. I just didn't feel right to do it just for this occasion so I asked here. Bottom line I thought that java regexes should do anything for FAs, that's why I asked. – iantonuk Feb 06 '17 at 18:22
  • I didn't mark up your answer because it's actually incorrect. If you try to infer it ("a and b above are sub-regexps of the original regex"). You can see that in the second rule we divide our regex in 2 those subregexs: a and b. Then applying (in order) first rule to a -> a? will lead to a?b? which incorrect because it matches "b", which it should not. So a and b can not be other regexs.(i.e the set of doesn't work). What rules should do is finding all possible suffixes and marking them '?' – iantonuk Feb 06 '17 at 20:46
  • I agree with you regarding my answer. You can notice there is an "etc" as the last item; I didn't work very hard writing it and I know it doesn't help you very much. Especially since you don't have a small, known regex that can be easily converted. The second conversion rule should read: `Ab -> Ab?` where `b` is a character and `A` is a sub-expression. The others should also be reformulated to be usable. Initially I didn't intend to write an answer but it was too long for a comment. – axiac Feb 06 '17 at 21:04
  • My counterexample still holds: Ab -> Ab? ->(A)b?->(a)b?->a?b? not fit. You need to give a set of rules which will do exactly all suffixes optional and nothing more. – iantonuk Feb 07 '17 at 16:57