Is there a way to validate that a regular expression will consume, at most, one character?

Question

This is a particularly hard thing to Google, because most questions are about how one writes a regular expression to match a single character, which is not my question.

My question is: if I have a JavaScript / TypeScript API, that allows a user to supply any given regular expression, but their regular expression should match only 0-1 characters, how would I throw an error if the regular expression a user wrote can match more than one character?

For example:

/[a-z]/        // valid
/[a-z][A-Z]/   // invalid
/[a-z]{1}/     // valid
/[a-z]{2}/     // invalid
/[a-z]*/       // invalid
/[a-z]+/       // invalid

...etc

It feels like it could get tedious to think of all the ways someone could specify a regex to match multiple characters. Any thoughts on how this could be accomplished?

My best guess is that you have to try to parse the regular expression syntax yourself. (trigger happy closerS: I don't think the close reason is correct, this has a very clear problem statement.) — Evert, Jul 20 '23 at 19:14
[regexp-tree](https://www.npmjs.com/package/regexp-tree#ast-nodes-specification) library can parse a regular expression string into an AST (tree representation). You can check it [online](https://astexplorer.net/#/gist/4ea2b52f0e546af6fb14f9b2f5671c1c/39b55944da3e5782396ffa1fea3ba68d126cd394). but still there are many ways to match 2 or more chars. — IT goldman, Jul 20 '23 at 21:28
If you don't have to validate the regex beforehand, you can try matching it first and then check if there's a match containing more than one character. — InSync, Jul 21 '23 at 07:30
Are at liberty to change API, so that use supply you a character ranges like `a-zA-Z` and your API will convert it to regex `/[a-zA-Z]/`? And if not, do you expect to accept something more complex than `[..something inside ..]`? — markalex, Jul 22 '23 at 08:11
@jcalz I feel like it isn’t technically impossible (unless it truly is?) I know the Chevrotain parser parses regex into its parsing API for convenience and to make decision trees, but maybe there are unknown limits there. — Matthew Dean, Jul 23 '23 at 17:35
I think this is a really interesting question - but I wouldn't be surprised if it's a mathematically undecidable problem... — Gershom Maes, Jul 24 '23 at 20:21
If the pattern should match only a single character, why are you using regex at all? Just accept a list of characters (or ranges) instead. Or is this intended to allow lookaround? — Bergi, Jul 25 '23 at 09:54

jcalz · Answer 1 · 2023-07-28T01:29:32.423

It is not possible to write a function f() that takes an arbitrary user-supplied JS regular expression and accurately decides whether or not the regular expression could ever match a string consisting of more than one character. Any function you write will either sometimes return an incorrect result, or you will need to allow for the function to return an "I don't know" result.

There are what amounts to formal proofs of this, but I won't try to present them here. Instead I'll just point to On Lookaheads in Regular Expressions with Backreferences by Nariyoshi Chida and Tachio Terauchi, which shows that the emptiness problem for the sort of regular expressions that exist JavaScript (which include backreferences and lookahead and lookbehind assertions) is undecidable. That means it is not possible to write a function that will always correctly decide whether an input JS regular expression has any matches at all.

And if there were a magical function f() to answer the question for length two or more, then you could use it to build an answer to the emptiness problem, by testing the empty string and every string of length one (this is tedious but theoretically possible), and combine the results of that with the magical function to get the full solution to the emptiness problem. Since the emptiness problem is undecidable, so is the problem you've described.

So no, it cannot be done for arbitrary JavaScript regular expressions.

Assuming that's too abstract, imagine the user supplies a specific (horrible) regular expression r, and let's investigate if we can write a function f() that can reliably throw an error if and only if r.test(s) === false for all s where s.length > 1. Here's the monster:

const r = /^x(?!x*(?<!^x(?:x?|\1+(xx+)))(?!(?:x?|(xx+?)\2+)$))($|xx(xx)+)$/

I claim that r will match a string s if and only if s satisfies all of these criteria:

it consists only of the letter "x". That is, /^x*$/.test(s) === true, and
its length is an odd number not equal to three. That is, s.length % 2 == 1 && s.length !== 3, and
its length cannot be written as p+q+1 where p and q are prime numbers. That is, assuming you have a function primes(n) that returns an array of all prime numbers less than n, then primes(s.length).every(p => primes(s.length-p).every(q => s.length !== p+q+1))

I built r using the regular expression mentioned in How to determine if a number is a prime with regex? along with lookaheads and lookbehinds. Roughly, it says that there is no point in the string where the number of characters before it is one-plus-a-prime (using look-behind) and where the number of characters after it is a prime (using look-ahead).

I don't know if that convinces you that my claim about what r does is correct but you can test it if you want. Let's assume for the moment that it is. That means it accepts the input "x", since its length is 1, and 1 is not the sum of two prime numbers:

console.log(r.test("x")); // true

So far this doesn't invalidate r because it's okay if it accepts a one-character string like "x".

But: is there a string of two or more "x" characters that it would accept? Should f(r) throw an error? Well, that would require we find an odd number larger than three which is not the sum of two primes. Which means we need to find an even number larger than two which is not the sum of two primes.

In other words: f(r) should not throw an error if and only if every even number greater than two is equal to the sum of two prime numbers. But that's the same as Goldbach's conjecture, a famous unsolved math problem. Mathematicians have been trying for hundreds of years to determine if that is true or false, and we haven't figured it out yet, as of 2023. We think that it is true, and we know that if there is a counterexample it's very large, but it has not been proven.

That means the function f() would need to be able to prove or disprove Goldbach's conjecture in order to work properly. That by itself doesn't mean it's impossible, but it does mean that nobody currently knows how to do it.

Even if my claim about r's behavior is incorrect, or if you want to get technical and say that Goldbach's conjecture has been confirmed for all numbers which could possibly be JS string lengths, this should still give you serious pause, since it hopefully demonstrates that one can come up with JS regular expressions where it is not at all clear which strings it might accept.

So, there you go. For arbitrary JS regular expression inputs, it's impossible, and even if it were possible, it would be very difficult.

If you want to restrict the possible inputs to just a subset of the features of JS regular expressions, say by prohibiting backreferences and lookarounds, then the answer would probably change. The emptiness problem for regular languages is decidable, and you could probably use that result to write an algorithm that works for strings of length two or more. But that would be a different question and is out of scope for the question as asked.

Finally, let's step way back and look at what you're trying to do. It is almost certainly more trouble than it's worth to allow a user to supply arbitrary JS regular expressions, if you need to do any sort of validation of them at all.

Instead, you should consider accepting some simpler data structure that cannot be misused (either intentionally or unintentionally). Depending on your use case, you might switch to just a string that holds all the characters you want to accept, or a set of enums corresponding to common character ranges, etc.

Regular expressions are notoriously tricky to work with, as evidenced by the famous aphorism:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

If you switch away from regular expressions, you'll cut your number of problems in half.

Playground link to code

Chris Barr · Answer 2 · 2023-07-24T21:07:34.387

0

Do you know roughly what the data the regex will test against?

If so you could provide a multi-character test string and if it allows for that then you know it won't fit your criteria

[
  /[a-z]/,
  /[a-z][A-Z]/,
  /[a-z]{1}/,
  /[a-z]{2}/,
  /[a-z]*/,
  /[a-z]+/
]
.forEach(p => {
  const m = 'aa'.match(p);
  console.log(p, m !== null && m[0].length === 1);
});

edited Jul 24 '23 at 21:07

answered Jul 24 '23 at 21:01

Chris Barr

29,851
23
95
135

Is there a way to validate that a regular expression will consume, at most, one character?

2 Answers2