33

Suppose that I have this regular expression: /abcd/ Suppose that I wanna check the user input against that regex and disallow entering invalid characters in the input. When user inputs "ab", it fails as an match for the regex, but I can't disallow entering "a" and then "b" as user can't enter all 4 characters at once (except for copy/paste). So what I need here is a partial match which checks if an incomplete string can be potentially a match for a regex.

Java has something for this purpose: .hitEnd() (described here http://glaforge.appspot.com/article/incomplete-string-regex-matching) python doesn't do it natively but has this package that does the job: https://pypi.python.org/pypi/regex.

I didn't find any solution for it in js. It's been asked years ago: Javascript RegEx partial match and even before that: Check if string is a prefix of a Javascript RegExp

P.S. regex is custom, suppose that the user enters the regex herself and then tries to enter a text that matches that regex. The solution should be a general solution that works for regexes entered at runtime.

Community
  • 1
  • 1
Sassan
  • 2,187
  • 2
  • 24
  • 43
  • The correct way to do this is to _not_ let a user enter a regex to run. Why would you do that? A partial match is based on the 0 or more quantifier. If that's a problem then the user can't enter. –  Mar 03 '17 at 23:10
  • 4
    End user is not entering regex. The regex is entered by some developers. But I wanna automate the procedure. I don't want to ask them to create a "entering" version of their regex. It is indeed possible in Java and Python. I'm asking for a way to do so in JS. So the correct way to do this in my case is indead letting the user enter a regex to run. – Sassan Mar 03 '17 at 23:19
  • `I don't want to ask them to create a "entering" version of their regex.` Your gonna have to. Like I said, you can't just automate the modification of a regex and convert it to reluctant quantifiers. For one thing, you'd have to _parse_ the regex, nesting and all. For another, you run the risk of catastrophic backtracking. It won't work, nothing exists. –  Mar 03 '17 at 23:31
  • The best you could do is to create an input form for developers. Where they are only allowed to enter sequential pieces like a construct, a class or property. Then give them a set of radio's or text boxes for quantifiers. You gather the core info, then set the quantifiers based on their quantifier preferences. _But you assembled the final regex_. This you can automate. –  Mar 03 '17 at 23:39
  • I can do what described in the question in Java and Python, the question is about how to doing the same in js. The hard solution is to implement the relevant part of the python module in js. So there is a solution. I'm asking for easier solutions. But there is indeed a solution. – Sassan Mar 04 '17 at 00:05
  • There is a solution. It's called a _Partial Match_. It's not difficult for an engine to return what it matched when it hit the end of the target string, even if the entire regex did not match. The general problem with this is the state of the match. Meaning other than group 0, no group is reliable as to what it matched. And there are some other engine gyrations the engine has to do. If JS can't do partial matching, you'd need another engine. To do it the way of altering the regex.. will never work correctly. Might as well generate a regex from the input and see if it matches the regex. –  Mar 04 '17 at 16:40
  • Partial. What if it didn't reach the EOS and failed before it? That means you can never use the end anchor `$` in any regex substitution. You could, but it would not match up to the point of an invalid character. And you would not get a partial. –  Mar 04 '17 at 17:00

6 Answers6

22

Looks like you're lucky, I've already implemented that stuff in JS (which works for most patterns - maybe that'll be enough for you). See my answer here. You'll also find a working demo there.

There's no need to duplicate the full code here, I'll just state the overall process:

  • Parse the input regex, and perform some replacements. There's no need for error handling as you can't have an invalid pattern in a RegExp object in JS.
  • Replace abc with (?:a|$)(?:b|$)(?:c|$)
  • Do the same for any "atoms". For instance, a character group [a-c] would become (?:[a-c]|$)
  • Keep anchors as-is
  • Keep negative lookaheads as-is

Had JavaScript have more advanced regex features, this transformation may not have been possible. But with its limited feature set, it can handle most input regexes. It will yield incorrect results on regex with backreferences though if your input string ends in the middle of a backreference match (like matching ^(\w+)\s+\1$ against hello hel).

Community
  • 1
  • 1
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • What more advanced features are you referring to? It's already got lookarounds which I could see being problematic, but it looks like your code handle these – Aaron Mar 04 '17 at 14:56
  • 1
    @Aaron I was thinking of stuff like recursion, or backtracking verbs like `(*SKIP)`. Hmm and now I'm thinking of it, my code won't handle backreferences properly if the input string ends in the middle of a backreference match. – Lucas Trzesniewski Mar 04 '17 at 14:59
  • Thanks ! And I think you don't need to : if the input ends in the middle of a backreference, it means you haven't reached the point in the pattern where the backreference is enforced. – Aaron Mar 04 '17 at 15:07
  • `(?:a|$)(?:b|$)(?:c|$)` but `abd` would fail entirely. Not really a partial way. End up throwing out `ab` when only `d` is bad. –  Mar 04 '17 at 17:05
  • 1
    @sln the point of partial matching is mostly to tell if an input string failed to match *because more input is required*. `abd` is *not* a partial match to `abc`. This lets you highlight an input field in red as soon as the user enters something unexpected, like that `d` character. – Lucas Trzesniewski Mar 04 '17 at 17:31
  • @LucasTrzesniewski - Not sure what you mean by highlight. I didn't mean to say the overall objective can't be accomplished. But that is in the realm of handling input, not the regex. The regex could return a partial, then the input could just be overwritten, deleting the bad character. You can't use `$` doing it this way. However, you could use the regex anchor and keep track of the last good match. This way you'd have a substring of valid characters from which to highlight bad ones. But that's really the only way. –  Mar 04 '17 at 17:47
  • @LucasTrzesniewski thanks, I'll mark it as accepted answer and will give it the bounty as soon as I do some tests (probably happens tomorrow.) – Sassan Mar 05 '17 at 05:34
  • Seems like it's exactly what I need. – Sassan Mar 05 '17 at 05:42
  • 1
    @Aaron - It's all about the match failing. What does this `if the input ends in the middle of a backreference` mean? It means there could be a point where the match _fails_ and the user can never get past it. Example using his atomic approach: `((?:[a-z]|$){3})(?:\d|$){2}(?:\1|$)` will _never_ match the backreference. When it gets there it will always fail and never match. There are many other problems as well. In general, fwiw, this will never work. –  Mar 05 '17 at 18:10
2

As many have stated there is no standard library, fortunately I have written a Javascript implementation that does exactly what you require. With some minor limitation it works for regular expressions supported by Javascript. see: incr-regex-package.

Further there is also a react component that uses this capability to provide some useful capabilities:

  1. Check input as you type
  2. Auto complete where possible
  3. Make suggestions for possible input values

Demo of the capabilities Demo of use

NurulC
  • 31
  • 1
  • 2
    This might be a good answer, but you need to [provide code on how to use your library in the answer](//meta.stackoverflow.com/questions/251602/recommending-off-site-resources-when-questions-dont-ask-for-it/251605#251605). It's good that you've done the most important thing: stating that you wrote it. You should also be aware that other users tend to respond negatively to answers which are mostly a link to an external library, particularly when it's a link to your own library and when you don't show how to use it in the answer. I suggest you use a snippet in the answer to demonstrate its use. – Makyen Dec 24 '19 at 17:14
1

I think that you have to have 2 regex one for typing /a?b?c?d?/ and one for testing at end while paste or leaving input /abcd/

This will test for valid phone number:

const input = document.getElementById('input')

let oldVal = ''
input.addEventListener('keyup', e => {
  if (/^\d{0,3}-?\d{0,3}-?\d{0,3}$/.test(e.target.value)){
    oldVal = e.target.value
  } else {
    e.target.value = oldVal
  }
})
input.addEventListener('blur', e => {
  console.log(/^\d{3}-?\d{3}-?\d{3}-?$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">

And this is case for name surname

const input = document.getElementById('input')

let oldVal = ''
input.addEventListener('keyup', e => {
  if (/^[A-Z]?[a-z]*\s*[A-Z]?[a-z]*$/.test(e.target.value)){
    oldVal = e.target.value
  } else {
    e.target.value = oldVal
  }
})
input.addEventListener('blur', e => {
  console.log(/^[A-Z][a-z]+\s+[A-Z][a-z]+$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">
Maciej Kozieja
  • 1,812
  • 1
  • 13
  • 32
  • As I said regex is generated in runtime by enduser. So if you don't provide a way to automatically create the "typing regex" it doesn't help. – Sassan Feb 25 '17 at 21:44
  • You know that you can alwasy re compute it ? – Maciej Kozieja Feb 25 '17 at 21:58
  • I don't know how, what's the algorithm to generate/compute the typing regex based on the complete regex? it should work for every regex and should be done by computer. – Sassan Feb 25 '17 at 22:33
  • You can't generate a typing regex, there is no such thing. In fact, using 0 or more sequential quantifiers, especially if nested, could cause serious backtracking problems. –  Mar 03 '17 at 23:15
  • `a?b?c?d?` wouldn't be correct anyway. It would have to be `(?:a(?:b(?:c)?)?)?` –  Mar 04 '17 at 16:43
1

This is the hard solution for those who think there's no solution at all: implement the python version (https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default) in js. So it is possible. If someone has simpler answer he'll win the bounty.

Example using python module (regular expression with back reference):

$ pip install regex
$ python
>>> import regex
>>> regex.Regex(r'^(\w+)\s+\1$').fullmatch('abcd ab',partial=True)
<regex.Match object; span=(0, 7), match='abcd ab', partial=True>
Sassan
  • 2,187
  • 2
  • 24
  • 43
  • There is no solution at all. Post some examples of this working and I'll post 10 more where it doesn't. –  Mar 04 '17 at 17:04
  • Ofc there is, and this is the complete solution: http://stackoverflow.com/a/42597270/1349278 – Sassan Mar 05 '17 at 05:30
  • `Ofc there is, and this is the complete solution:` - I just added a comment to @Aaron there with an example. It's not the solution at all. Sometimes you have to face reality, hope and wishing are not good partners to software engineering. –  Mar 05 '17 at 18:17
  • 2
    He mentioned the problem with back-reference himself.As I said `Ofc there is a solution` Java has this feature natively, python has a module with this feature.I just checked partial matching with backref in python and worked.Java and Python both are Turing Complete, Javascript is Turing Complete too so whatever is possible in Java and Python is possible in Javascript too. If you wanna prove the opposite you can simply provide an example in Java that fails.I'm not a software engineer, I'm a computer scientist and in science you should always face reality not `sometimes`. – Sassan Mar 06 '17 at 11:00
0

You guys would probably find this page of interest:

(https://github.com/desertnet/pcre)

It was a valiant effort: make a WebAssembly implementation that would support PCRE. I'm still playing with it, but I suspect it's not practical. The WebAssembly binary weighs in at ~300K; and if your JS terminates unexpectedly, you can end up not destroying the module, and consequently leaking significant memory.

The bottom line is: this is clearly something the ECMAscript people should be formalizing, and browser manufacturers should be furnishing (kudos to the WebAssembly developer into possibly shaming them to get on the stick...)

I recently tried using the "pattern" attribute of an input[type='text'] element. I, like so many others, found it to be a letdown that it would not validate until a form was submitted. So a person would be wasting their time typing (or pasting...) numerous characters and jumping on to other fields, only to find out after a form submit that they had entered that field wrong. Ideally, I wanted it to validate field input immediately, as the user types each key (or at the time of a paste...)

The trick to doing a partial regex match (until the ECMAscript people and browser makers get it together with PCRE...) is to not only specify a pattern regex, but associated template value(s) as a data attribute. If your field input is shorter than the pattern (or input.maxLength...), it can use them as a suffix for validation purposes. YES -this will not be practical for regexes with complex case outcomes; but for fixed-position template pattern matching -which is USUALLY what is needed- it's fine (if you happen to need something more complex, you can build on the methods shown in my code...)

The example is for a bitcoin address [ Do I have your attention now? -OK, not the people who don't believe in digital currency tech... ] The key JS function that gets this done is validatePattern. The input element in the HTML markup would be specified like this:

<input id="forward_address"
       name="forward_address"
       type="text"
       maxlength="90"
       pattern="^(bc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})|[13][a-km-zA-HJ-NP-Z1-9]{25,34})$"
       data-entry-templates="['bc099999999999999999999999999999999999999999999999999999999999','bc1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','19999999999999999999999999999999999']"
       onkeydown="return validatePattern(event)"
       onpaste="return validatePattern(event)"
       required
/>

[Credit goes to this post: "RegEx to match Bitcoin addresses? " Note to old-school bitcoin zealots who will decry the use of a zero in the regex here -it's just an example for accomplishing PRELIMINARY validation; the server accepting the address passed off by the browser can do an RPC call after a form post, to validate it much more rigorously. Adjust your regex to suit.]

The exact choice of characters in the data-entry-template was a bit arbitrary; but they had to be ones such that if the input being typed or pasted by the user is still incomplete in length, it will use them as an optimistic stand-in and the input so far will still be considered valid. In the example there, for the last of the data-entry-templates ('19999999999999999999999999999999999'), that was a "1" followed by 39 nines (seeing as how the regex spec "{25,39}" dictates that a maximum of 39 digits in the second character span/group...) Because there were two forms to expect -the "bc" prefix and the older "1"/"3" prefix- I furnished a few stand-in templates for the validator to try (if it passes just one of them, it validates...) In each template case, I furnished the longest possible pattern, so as to insure the most permissive possibility in terms of length.

If you were generating this markup on a dynamic web content server, an example with template variables (a la django...) would be:

 <input id="forward_address"
        name="forward_address"
        type="text"
        maxlength="{{MAX_BTC_ADDRESS_LENGTH}}"
        pattern="{{BTC_ADDRESS_REGEX}}" {# base58... #}
        data-entry-templates="{{BTC_ADDRESS_TEMPLATES}}" {# base58... #}
        onkeydown="return validatePattern(event)"
        onpaste="return validatePattern(event)"
        required
/>

[Keep in mind: I went to the deeper end of the pool here. You could just as well use this for simpler patterns of validation.]

And if you prefer to not use event attributes, but to transparently hook the function to the element's events at document load -knock yourself out.

You will note that we need to specify validatePattern on three events:

  • The keydown, to intercept delete and backspace keys.

  • The paste (the clipboard is pasted into the field's value, and if it works, it accepts it as valid; if not, the paste does not transpire...)

Of course, I also took into account when text is partially selected in the field, dictating that a key entry or pasted text will replace the selected text.

And here's a link to the [dependency-free] code that does the magic:

https://gitlab.com/osfda/validatepattern.js

(If it happens to generate interest, I'll integrate constructive and practical suggestions and give it a better readme...)

PS: The incremental-regex package posted above by Lucas Trzesniewski:

  • Appears not to have been updated? (I saw signs that it was undergoing modification??)

  • Is not browserified (tried doing that to it, to kick the tires on it -it was a module mess; welcome anyone else here to post a browserified version for testing. If it works, I'll integrate it with my input validation hooks and offer it as an alternative solution...) If you succeed in getting it browserfied, maybe sharing the exact steps that were needed would also edify everyone on this post. I tried using the esm package to fix version incompatibilities faced by browserify, but it was no go...

Steve B.
  • 11
  • 1
  • 4
-2

I strongly suspect (although I'm not 100% sure) that general case of this problem has no solution the same way as famous Turing's "Haltin problem" (see Undecidable problem). And even if there is a solution, it most probably will be not what users actually want and thus depending on your strictness will result in a bad-to-horrible UX.

Example:

Assume "target RegEx" is [a,b]*c[a,b]* also assume that you produced a reasonable at first glance "test RegEx" [a,b]*c?[a,b]* (obviously two c in the string is invalid, yeah?) and assume that the current user input is aabcbb but there is a typo because what the user actually wanted is aacbbb. There are many possible ways to fix this typo:

  • remove c and add it before first b - will work OK
  • remove first b and add after c - will work OK
  • add c before first b and then remove the old one - Oops, we prohibit this input as invalid and the user will go crazy because no normal human can understand such a logic.

Note also that your hitEnd will have the same problem here unless you prohibit user to enter characters in the middle of the input box that will be another way to make a horrible UI.

In the real life there would be many much more complicated examples that any of your smart heuristics will not be able to account for properly and thus will upset users.

So what to do? I think the only thing you can do and still get reasonable UX is the simplest thing you can do i.e. just analyze your "target RegEx" for set of allowed characters and make your "test RegEx" [set of allowed chars]*. And yes, if the "target RegEx" contains . wildcart, you will not be able to do any reasonable filtering at all.

SergGr
  • 23,570
  • 2
  • 30
  • 51
  • "aabcbb" is correct input for regex `[a,b]*c[a,b]*` and I do prohibit user to enter wrong characters in the middle of the input so `hitEnd` won't have any problems. In the real live I can handle all cases with `hitEnd` or with that python module I linked. I just don't know how to do so in JS. – Sassan Mar 03 '17 at 23:21
  • @Sassan, I think you didn't get my point. You are concentrating on single scenario "user enters input matching RegEx from scratch" and I say that there are also many scenarios "user has one valid or even partiallly valid string and tries to change it to another valid string". There are might be many reasons such as typos or copy-paste. Unless your UI actually somehow prohibits poitnting cursor to the middle of the input (which is horrible idea), your `hitEnd` approach will not work as normal user expects it to work. – SergGr Mar 03 '17 at 23:32
  • Trvial example: you are making an input for decimal numbers. Do you really think that prohibitting two `.` **_in the middle of editting_** is a good idea from the UX point of view? If so, create such a control and try to use it _yourself_ for some time and only then if you are still satisfied with it, implement your solution for other people to struggle. – SergGr Mar 03 '17 at 23:35
  • 3
    Lets not argue why I need this feature. My question is about how to do partial regex match on strings. Completely forget about ux and ui. I have my own reasons and it's beyond this discussion. The question is obvious and lets discuss the solutions for the question. – Sassan Mar 03 '17 at 23:49