Regex - check if input still has chances to become matching

Question

We've got such regexp:

var regexp = /^one (two)+ three/;

So only string like "one two three" or "one two three four" or "one twotwo three" etc. will match it.

However, if we've got string like

"one " - is still 'promising' that maybe soon it will match

but this string: "one three" will never match no matter what we'll do.

Is there some way to check if given string have chances to become matching or not?

I need it for some tips during writing when I want to recommend all options that begins with given input (regexp's I'm using are pretty long and I dont want really to mess with them).

In other words - I want to check if string has ended during checking and nothing 'not matching' was faced.

In even more other words - Answer would be inside reason of not matching. If reason is end of string - then it would be promissing. However I dont know any way to check why some string didnt match

Are you looking for a function that will indicate completeness of partial match numerically or indicate no match if the input won't match? So for `"one "` it would return `.3076923` (4/13) but for `"one three"` it would return `-1`? — Jason Aller, Mar 18 '14 at 15:09
Seems like the best way forward would be to write a whole library to do just that and cater for all the different scenarios that JS' regex can offer... — Jerry, Mar 18 '14 at 16:28
Very interesting question. You can rewrite your expression in the following form: `^o(?:n(?:e(?: (?:(?:two)*(?:t(?:w(?:o(?: (?:t(?:h(?:r(?:e(?:e.*?)?)?)?)?)?)?)?)?)?)?)?)?)?$` to check if the text is promising. If your expressions are not complex, I believe it is possible to automate derivation of promising regex from main one. — Ulugbek Umirov, Mar 18 '14 at 18:25
Are you married to JavaScript? This question comes up a lot, and it's just not possible unless the capability is built in to the regex implementation. It's usually called *partial matching*, and as far as I know, Java and [Boost](http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html) are the only flavors that support it. — Alan Moore, Mar 19 '14 at 02:08
@AlanMoore FYI PCRE [also supports this](http://www.pcre.org/current/doc/html/pcre2partial.html). — Lucas Trzesniewski, Jan 10 '17 at 20:48

Lucas Trzesniewski · Accepted Answer · 2020-07-18T16:07:32.630

This is a regex feature known as partial matching, it's available in several regex engines such as PCRE, Boost, Java but not in JavaScript.

Andacious's answer shows a very nice way to overcome this limitation, we just need to automate this.

Well... challenge accepted :)

Fortunately, JavaScript has a very limited regex feature set with a simple syntax, so I wrote a simple parser and on-the-fly transformation for this task, based on the features listed on MDN. This code has been updated to handle ES2018 features.

A couple points of interest:

This produces a regex which will almost always match the empty string. Therefore a failed partial match occurs when the result of exec is null or an array whose first element is the empty string
Negative lookaheads are kept as-is. I believe that's the right thing to do. The only ways to fail a match is through them (ie put a (?!) in the regex) and anchors (^ and $). Lookbehinds (both positive and negative) are also kept as-is.
The parser assumes a valid input pattern: you can't create a RegExp object from an invalid pattern in the first place. This may break in the future if new regex features are introduced.
This code won't handle backreferences properly: ^(\w+)\s+\1$ won't yield a partial match against hello hel for instance.

RegExp.prototype.toPartialMatchRegex = function() {
    "use strict";
    
    var re = this,
        source = this.source,
        i = 0;
    
    function process () {
        var result = "",
            tmp;

        function appendRaw(nbChars) {
            result += source.substr(i, nbChars);
            i += nbChars;
        };
        
        function appendOptional(nbChars) {
            result += "(?:" + source.substr(i, nbChars) + "|$)";
            i += nbChars;
        };

        while (i < source.length) {
            switch (source[i])
            {
                case "\\":
                    switch (source[i + 1])
                    {
                        case "c":
                            appendOptional(3);
                            break;
                            
                        case "x":
                            appendOptional(4);
                            break;
                            
                        case "u":
                            if (re.unicode) {
                                if (source[i + 2] === "{") {
                                    appendOptional(source.indexOf("}", i) - i + 1);
                                } else {
                                    appendOptional(6);
                                }
                            } else {
                                appendOptional(2);
                            }
                            break;

                        case "p":
                        case "P":
                            if (re.unicode) {
                                appendOptional(source.indexOf("}", i) - i + 1);
                            } else {
                                appendOptional(2);
                            }
                            break;

                        case "k":
                            appendOptional(source.indexOf(">", i) - i + 1);
                            break;
                            
                        default:
                            appendOptional(2);
                            break;
                    }
                    break;
                    
                case "[":
                    tmp = /\[(?:\\.|.)*?\]/g;
                    tmp.lastIndex = i;
                    tmp = tmp.exec(source);
                    appendOptional(tmp[0].length);
                    break;
                    
                case "|":
                case "^":
                case "$":
                case "*":
                case "+":
                case "?":
                    appendRaw(1);
                    break;
                    
                case "{":
                    tmp = /\{\d+,?\d*\}/g;
                    tmp.lastIndex = i;
                    tmp = tmp.exec(source);
                    if (tmp) {
                        appendRaw(tmp[0].length);
                    } else {
                        appendOptional(1);
                    }
                    break;
                    
                case "(":
                    if (source[i + 1] == "?") {
                        switch (source[i + 2])
                        {
                            case ":":
                                result += "(?:";
                                i += 3;
                                result += process() + "|$)";
                                break;
                                
                            case "=":
                                result += "(?=";
                                i += 3;
                                result += process() + ")";
                                break;
                                
                            case "!":
                                tmp = i;
                                i += 3;
                                process();
                                result += source.substr(tmp, i - tmp);
                                break;

                            case "<":
                                switch (source[i + 3])
                                {
                                    case "=":
                                    case "!":
                                        tmp = i;
                                        i += 4;
                                        process();
                                        result += source.substr(tmp, i - tmp);
                                        break;

                                    default:
                                        appendRaw(source.indexOf(">", i) - i + 1);
                                        result += process() + "|$)";
                                        break;        
                                }
                                break;
                        }
                    } else {
                        appendRaw(1);
                        result += process() + "|$)";
                    }
                    break;
                    
                case ")":
                    ++i;
                    return result;
                    
                default:
                    appendOptional(1);
                    break;
            }
        }
        
        return result;
    }
    
    return new RegExp(process(), this.flags);
};






// Test code
(function() {
    document.write('<span style="display: inline-block; width: 60px;">Regex: </span><input id="re" value="^one (two)+ three"/><br><span style="display: inline-block; width: 60px;">Input: </span><input id="txt" value="one twotw"/><br><pre id="result"></pre>');
    document.close();

    var run = function() {
        var output = document.getElementById("result");
        try
        {
            var regex = new RegExp(document.getElementById("re").value);
            var input = document.getElementById("txt").value;
            var partialMatchRegex = regex.toPartialMatchRegex();
            var result = partialMatchRegex.exec(input);
            var matchType = regex.exec(input) ? "Full match" : result && result[0] ? "Partial match" : "No match";
            output.innerText = partialMatchRegex + "\n\n" + matchType + "\n" + JSON.stringify(result);
        }
        catch (e)
        {
            output.innerText = e;
        }
    };

    document.getElementById("re").addEventListener("input", run);
    document.getElementById("txt").addEventListener("input", run);
    run();
}());

I tested it a little bit and it seems to work fine, let me know if you find any bugs.

Nice solution to the OP. I just wonder if it would make sense that also the empty string is considered a partial match, the OP states that anything that might still match should be considered promising. To me, it would make sense to determine that it does not match only when there is an attempt at input that clearly can never match the regex. — Tomas Langkaas, Jan 12 '17 at 15:12
@TomasLangkaas sure, the empty string can *still* match if you add content to it. The transformed regex won't try to do anything about it and *will* return a result on an empty input string. I just filtered that out in the demo code by using `result && result[0]` as a condition, but you may choose to remove the `&& result[0]` part to consider the empty string as a partial match. — Lucas Trzesniewski, Jan 12 '17 at 16:06
@LucasTrzesniewski would you mind if I'd create github repo with your solution for RegExp? — Adam Pietrasiak, May 21 '17 at 06:50
@AdamPietrasiak not at all, go on. Make sure to mention the backreferences handling limitation though. — Lucas Trzesniewski, May 21 '17 at 10:37
@DavidCulbreth no, unless someone else made it (I don't manage any npm packages) — Lucas Trzesniewski, Feb 22 '19 at 10:04
There is this, a Incremental Regular Expression Matcher with an NPM package: https://github.com/nurulc/incr-regex-package — Nathan, Jul 17 '19 at 23:21
This is great, however it didn't work correctly for an email regex taken from https://emailregex.com/ If I input: `[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,64}` and put in a partial input of: `a@a.com.` It considers that a full match when it's in fact incomplete. — Patrick, Aug 06 '19 at 06:38
@Patrick `a@a.com` is actually a full match against your regex. See [here](https://regex101.com/r/Xhti2C/1). — Lucas Trzesniewski, Aug 06 '19 at 12:02
@LucasTrzesniewski Was referring to a@a.com. (Additional dot at the end) — Patrick, Aug 06 '19 at 23:08
@Patrick oh ok sorry. It's *still* a full match if you add the dot at the end, since your pattern is not anchored. Wrap your pattern between `^` and `$` to anchor it. It's probably what you always want anyway when looking for a partial match. — Lucas Trzesniewski, Aug 07 '19 at 07:59
@LucasTrzesniewski you are a genius, worked perfectly thank you! — Patrick, Aug 08 '19 at 01:54
FYI this code enters an infinite loop while stepping onto named capturing group. E.g. this is a part of my regex using named backreference: (?KAT|BYT)-[A-Z]{2,}/00\d{4}/%y/\k — Tomasz Pala, Jul 16 '20 at 05:54
@TomaszPala thanks for the info! Named capturing groups did not exist in JS back when I wrote that code, which explains why it doesn't work. I'm not sure if I should update this answer to handle the new features (and continue doing it in the future) or just leave it as-is. Also, backreferences are not handled properly (unfortunately). — Lucas Trzesniewski, Jul 16 '20 at 12:26
@Lucas - it would be really nice if your code get some update (obviously:]) to handle such features, but I'm not in position to ask for it. In fact, the example I've given comes from my backend engine that validates input after being submitted - which I exposed into UI for some soft pre-validation. Thus I'm not even sure if JS itself handles this properly. But what I must suggest is some improvement, which should be made - either skip over such not-fully/unhandled features (which would most probably make the validation fail, unless you lazy-"handle" this by replacing with ".*"). — Tomasz Pala, Jul 18 '20 at 11:01
...or at least add some "default: return false", as the infinite loop is the worst any code can get into. — Tomasz Pala, Jul 18 '20 at 11:01
@TomaszPala I updated the code to handle the new ES2018 features. Your pattern should now be handled, except for the backreference. — Lucas Trzesniewski, Jul 18 '20 at 16:10
I must say I'm very impressed with the results! Named backref must be entered in the full captured form, but this is fully understandable for such short drop-in solution:) Kudos! — Tomasz Pala, Jul 21 '20 at 09:38

score 16 · Answer 2 · answered Mar 18 '14 at 20:09

16

Another interesting option that I have used before is to OR every character expected with the $ symbol. This may not work well for every case but for cases where you are looking at specific characters and need a partial match on each character, this works.

For example (in Javascript):

var reg = /^(o|$)(n|$)(e|$)(\s|$)$/;

reg.test('')      -> true;
reg.test('o')     -> true;
reg.test('on')    -> true;
reg.test('one')   -> true;
reg.test('one ')  -> true;
reg.test('one t') -> false;
reg.test('x')     -> false;
reg.test('n')     -> false;
reg.test('e')     -> false;
reg.test(' ')     -> false;

While this isn't the prettiest regex, it is repeatable so if you need to generate it dynamically for some reason, you know the general pattern.

The same pattern can be applied to whole words as well, which probably isn't as helpful because they couldn't type one-by-one to get to these points.

var reg = /^(one|$)(\stwo|$)$/;

reg.test('')        -> true;
reg.test('one')     -> true;
reg.test('one ')    -> false;
reg.test('one two') -> true;

answered Mar 18 '14 at 20:09

Andacious

1,162
10
17

2

+1. It doesn't solve the OP's problem, but it's a neat technique. In situations where it does apply, it could be a life saver. – Alan Moore Mar 19 '14 at 01:50
3

I based [my answer](http://stackoverflow.com/a/41580048/3764814) upon your idea, hope you don't mind :) – Lucas Trzesniewski Jan 12 '17 at 10:05
1

This is a great answer. What a neat technique – Lars Holdaas Dec 17 '20 at 05:11
1

This is just astonishinly clever. I was about to roll up my sleeves and go code up my answer here: https://stackoverflow.com/questions/12015612/convert-a-regular-expression-a-that-accepts-a-set-of-strings-to-one-that-accepts/67570793#answer-67570793 (which I was really proud of BTW) and then I found @LucasTrzesniewski 's implementation of this above, and I thought "why is this so short??" It's absurdly simple. Bravo. – Don Hatch May 19 '21 at 01:54
@AlanMoore In what way does it not solve the OP's problem? – Don Hatch May 19 '21 at 11:56
@DonHatch, I was assuming the OP's real regexes would be more complex, but now that I see [his own answer](https://stackoverflow.com/a/22493289/20938), it seems I was wrong. Never mind. – Alan Moore Jun 17 '21 at 05:30

score 2 · Answer 3 · answered Jul 17 '19 at 23:31

I have found a npm package with a JavaScript implementation of RegEx with incremental regular expression matching: https://www.npmjs.com/package/incr-regex-package. It looks like it is worth checking out. It can report DONE, MORE, MAYBE and FAILED results for a given input.

There is also an example implementation of an input component for React here: https://www.npmjs.com/package/react-maskedinput. It uses {RXInputMask} from incr-regex-package to provide a more user input focused view of interacting with the RegEx library.

Robin · Answer 4 · 2014-03-18T16:27:12.683

EDIT: after you edited your post the clarify, this answer might not be relevant to your specific question. Leaving it here for reference.

Inside the regex itself, on the subject of failing quickly once you know you won't find anything:

The anchor in your regex means that once the regex engine reaches the h of three, it will stop looking for matches and not try to start a match on the second character. On this case, I don't think you can do better than that (but it's already linear complexity, so not so bad).

On other cases I believe you have some usual best practices to learn in order to fail as quickly as possible when you know a match can't be found anymore.

You can look at possessive quantifiers if you don't already know them, but there are many other tricks...

rgthree · Answer 5 · 2014-03-18T15:40:48.667

0

Your original question was simply testing a string's placement within another, specifically, the start. The fastest way to do that would be to use substr on the match string, followed by indexOf. I've updated my original answer to reflect that:

function check(str){
  return 'one two three'.substr(0, str.length) === str;
};
console.log(check('one'));           // true
console.log(check('one two'));       // true
console.log(check('one two three')); // true
console.log(check('one three'));     // false

If you need case-insensitivity, it's still fastest to simply toLowerCase the match & input strings. (If interested, here's a jsperf of substr, indexOf and RegExp testing for the start of a string, case-insensitively: http://jsperf.com/substr-vs-indexof-vs-regex )

edited Mar 18 '14 at 15:40

answered Mar 18 '14 at 15:10

rgthree

7,217
17
21

Will it work with more complex ones (special chars, closures etc)? – Adam Pietrasiak Mar 18 '14 at 15:12
1

@Kluska000 No, it won't. This handles a very specific situation where your regex is literally `^something literal here`, in which case you might as well do `return rgxStr.substr(0,str.length) == str`. For that reason, I'm downvoting this answer. – Niet the Dark Absol Mar 18 '14 at 15:13
@NiettheDarkAbsol Ha, yeah. Answered too quickly before thinking it through. I've updated my answer to reflect the true nature of the answer. Thanks. – rgthree Mar 18 '14 at 15:44

Adam Pietrasiak · Answer 6 · 2014-03-19T11:59:21.853

Basing on @Andacious answer, I've made my own, a little more advanced and it seems to work.

I've made method that modifies regex making it accepting promising answers.

It replaces any literal char with "(?:OLD_LETTER|$)" so k becomes (?:k|$) looking for matching letter or end of input.

It also looks for parts not to replace like {1,2} and leave them as they are.

I'm sure it's not complete, but its very easy to add new rules of checking and main trick with any_sign or end of input seems to work in any case as end of string match and dont continue so basically we need to make such modification to main regex, that any literal char or group of chars will have alternative |$ and every syntax (that sometimes seems to have literal chars too) can't be destroyed.

RegExp.prototype.promising = function(){
    var source = this.source;
    var regexps = {
        replaceCandidates : /(\{[\d,]\})|(\[[\w-]+\])|((?:\\\w))|([\w\s-])/g,   //parts of regexp that may be replaced
        dontReplace : /\{[\d,]\}/g,     //dont replace those
    }

    source =  source.replace(regexps.replaceCandidates, function(n){ 
        if ( regexps.dontReplace.test(n) ) return n;
        return "(?:" + n + "|$)";
    });


    source = source.replace(/\s/g, function(s){
        return "(?:" + s + "|$)";
    });

    return new RegExp(source);

}

Test on jsFiddle

score 0 · Answer 7 · answered Jan 11 '17 at 00:54

0

Still not 100% sure what you are asking for, but you can also nest them, like this:

var regexp = /^(one)+((\s+two)+((\s+three)+((\s+four)+)?)?)?$/;

Matches:

one
one two two
one two two three
one two two three three four

Does not match:

one three

answered Jan 11 '17 at 00:54

James Wilkins

6,836
3
48
73

score -1 · Answer 8 · answered Mar 18 '14 at 15:10

-1

That's actually quite an interesting question.

Personally, I'd do it with the RegExp constructor, so the query can be broken down:

var input = "one ";
var query = "^one two three";
var q_len = query.length;
var match;
for( var i=1; i<=q_len; i++) {
    match = input.match(new RegExp(query.substr(0,i));
    if( !match) {alert("Pattern does NOT match"); break;}
    if( match[0].length == input.length) {alert("Input is promising!"); break;}
    // else continue to the next iteration
}

Obviously you can handle the "exact match" case up front to avoid the whole loop thing. If the whole pattern matches the input, then you're all good.

EDIT: I've just realised that this will not work for groups and stuff. It will fail on account of malformed regexes, but I hope it can serve as a basis for your query.

answered Mar 18 '14 at 15:10

Niet the Dark Absol

320,036
81
464
592

I've editet regexp in my question so its using special chars too. Will it still work? – Adam Pietrasiak Mar 18 '14 at 15:15
@Kluska000 I've edited my answer ;) Unfortunately, it will not work if you have any kind of grouping, because the resulting regex will be malformed and cause an error. You could wrap the loop's contents in a `try..catch` block to simply skip over iterations where the regex is invalid. – Niet the Dark Absol Mar 18 '14 at 15:16
Assuming all query or big part of it is grouped, this kind of 'try catch' would be catched for most of the time, so there would be 'big steps' of next working queries. – Adam Pietrasiak Mar 18 '14 at 15:21
Yup. I don't think there's any general-case solution to this problem, without the heavy use of parsers and stuff. – Niet the Dark Absol Mar 18 '14 at 15:24

elixenide · Answer 9 · 2014-03-18T19:23:23.317

Here's a simple proof of concept in a jsFiddle. You basically just loop through the proposed regex in reverse and look for the longest sub-match.

Note: this has a known issue in that it doesn't always handle groups well. For example, it will say foo bar b doesn't match foo( bar)+ at all, when it ought to say there's still hope. This could be fixed by getting much more creative with the following line:

temp_re = new RegExp(regex_part+'$'); // make sure this partial match ends in a piece that could still match the entire regex

Basically, you would need to parse the partial regex to see if it ends in a group, then check recursively for partial matches against that ending group. That's rather complicated, so I'm not getting into it for purposes of my demo.

JavaScript (using jQuery only for demo purposes):

var strings = {
    matches: "Matches!",
    stillhope: "There's still hope...",
    mismatch: "Does not match",
    empty: "No text yet"
};

// Object to handle watching for partial matches
function PartialRegexMonitor(regex, input_id) {
    var self = this;
    this.relen = regex.length;
    $('body').on('keyup', '#'+input_id, function() {
         self.update_test_results(regex, input_id);
    });
}

PartialRegexMonitor.prototype.update_test_results = function(regex, input_id) {
    var input = $('#'+input_id).val(),
        matches = find_partial_matches(regex, input),
        span = $('#'+input_id).siblings('.results');
    span.removeClass('match');
    span.removeClass('stillhope');
    span.removeClass('mismatch');
    span.removeClass('empty');
    span.addClass(matches.type)
        .html(strings[matches.type]);
}

// Test a partial regex against a string
function partial_match_tester(regex_part, str) {
    var matched = false;
    try {
        var re = new RegExp(regex_part, 'g'),
            matches = str.match(re),
            match_count = matches.length,
            temp_re;
        for(var i = 0; i < match_count; i++) {
            temp_re = new RegExp(regex_part+'$'); // make sure this partial match ends in a piece that could still match the entire regex
            matched = temp_re.test(str);
            if(matched) break;
        }
    }
    catch(e) {
    }
    return matched;
}

// Find the longest matching partial regex
function find_partial_matches(regex, str) {
    var relen = regex.length,
        matched = false,
        matches = {type: 'mismatch',
                   len: 0},
        regex_part = '';
    if(str.length == 0) {
        matches.type = 'empty';
        return matches;
    }

    for(var i=relen; i>=1; i--) {
        if(i==1 && str[0] == '^') { // don't allow matching against only '^'
            continue;
        }
        regex_part = regex.substr(0,i);
        // replace {\d}$ with {0,\d} for testing
        regex_part = regex_part.replace(/\{(\d)\}$/g, '{0,$1}');

        matched = partial_match_tester(regex_part, str);
        if(matched) {
            matches.type = (i==relen ? 'matches' : 'stillhope');
            console.log(matches.type + ": "+regex.substr(0,i)+" "+str);
            matches.len = i;
            break;
        }
    }
    return matches;
}

// Demo
$(function() {
    new PartialRegexMonitor('foo bar', 'input_0');
    new PartialRegexMonitor('^foo bar$', 'input_1');
    new PartialRegexMonitor('^fo+(\\s*b\\S[rz])+$', 'input_2');
    new PartialRegexMonitor('^\\d{3}-\\d{3}-\\d{4}$', 'input_3');
});

HTML for the demo:

<p>
Test against <span class="regex">foo bar</span>:<br/>
    <input type="text" id="input_0" />
    <span class="results empty">No text yet</span>
</p>
<p>
    Test against <span class="regex">^foo bar$</span>:<br/>
    <input type="text" id="input_1" />
    <span class="results empty">No text yet</span>
</p>
<p>
    Test against <span class="regex">^fo+(\s*b\S[rz])+$</span> (e.g., "foo bar", "foo baz", "foo bar baz"):<br/>
    <input type="text" id="input_2" />
    <span class="results empty">No text yet</span>
</p>
<p>
    Test against <span class="regex">^\d{3}-\d{3}-\d{4}$</span>:<br/>
    <input type="text" id="input_3" />
    <span class="results empty">No text yet</span>
</p>

CSS for the demo

.empty {
    background-color: #eeeeee;
}
.matches {
    background-color: #ccffcc;
    font-weight: bold;
}
.stillhope {
    background-color: #ccffff;
}
.mismatch {
    background-color: #ffcccc;
    font-weight: bold;
}
.regex {
    border-top:1px solid #999;
    border-bottom:1px solid #999;
    font-family: Courier New, monospace;
    background-color: #eee;
    color: #666;
}

Sample screenshot from the demo

enter image description here

But that also means that `foo bad` will pass and return true when I don't think it should. — Jerry, Mar 18 '14 at 16:51
Yeah, I think I misread the question. I'm going to update this answer. — elixenide, Mar 18 '14 at 16:52
http://jsfiddle.net/He5tp/9/ I was playing a little with your solution. Last one is not working. I'm thinking about modifing regex so foo is replaced with foo|fo|f fingind first longest posible matching result. — Adam Pietrasiak, Mar 18 '14 at 20:11
This should still work. Like I say, it may need tweaking when it is supposed to match a group at the end, but hopefully it gives you a place to start. :) — elixenide, Mar 18 '14 at 20:13

score -2 · Answer 10 · answered Mar 18 '14 at 15:18

Not sure if there is a way to do this with regex without making an extremely complicated pattern. But if you are just checking against a string then you can do something like this:

function matchPartialFromBeginning(pattern, value) {
    if(pattern.length == value.length)
        return pattern == value;
    if(pattern.length > value.length)
        return pattern.substr(0, value.length) == value;
    return false;
}

Regex - check if input still has chances to become matching

10 Answers10

Linked

Related