Regex for finding at most n consecutive patterns

Question

Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)

To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:

(Using Ruby)

somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"

at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]

However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:

matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]

Detailed strategy

I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):

DFA_to_be_regexed

Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.

Trip Hazards - not quite the right solution in spirit

Note that this question is not a dupicate of this post, as using the accepted methodology there would give:

somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]

Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.

You could suggest

somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]

(Thanks to Rubber Duck)

but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)

For lack of a concise explanation, I neglected to try to explain that epsilon (the end of the string) counts as a "0" in the diagram, meaning things which end in a single capital are accepted — xxjjnn, Jun 20 '13 at 08:45

score 2 · Accepted Answer · edited May 23 '17 at 11:46

Why your attempt does not work

There are a few problems with your current attempt.

The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.
The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.
There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).

Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!

Looking for another solution

What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:

(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....

This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like

[^A-Z]+(?:[A-Z][^A-Z]+)*

Now what about the edge cases? We can phrase them like this:

We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.
We want to allow a single capital letter at the end of the match, only if it's at the end of the string.

To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:

(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?

Now it's really working. And even better, we don't need capturing any more!

Generalizing the solution

This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z] to [A-Z]{1,n} and thereby allowing up to n capital letters where there is only one allowed so far.

See the demo for n = 2.

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

tl;dr

To match words containing at most N PATTERNs, use the regex

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

For example, to match words containing at most 1 capital letter,

/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/

This works for multi-character patterns too.

Clarification Needed

I'm afraid your examples may cause confusion. Let's add a few words:

somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, rerunning your at-least-2-capitals regex returns

at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]

Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.

Solution

Here's the solution either way.

First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:

/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/

For example, the at-most-1-capitals regex would be

/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/

and returns

=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]

To further exemplify, the at-most-2-capitals regex returns

=>

Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:

/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/

This returns

["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]

The general form is

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

You can see all these examples at http://ideone.com/hImmZr.

Sorry its not clear! I meant, how to get => [" deFgHij kLmN pQrS ", " abcdEf mixedCaps mixe", "aps mIxedCaps mIxe", "aps T ", " t tt"] — xxjjnn, Jun 20 '13 at 08:31

Massa · Answer 3 · 2013-06-19T11:13:32.313

to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:

irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]

or

irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]

EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:

irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)                                                                                    
=> [" deFgHij kLmN pQrS ", " abcdEf"]

(but that regex possibly won't match in the beginning and the end of the string)

It seems that

irb(main):026:0> somestring.split(/[A-Z]{3,}/)                                                                                                       
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]

would be better for that.

Have you looked at the OPs example input? This is not what he is looking for. He wants to find the longest substrings that do not contain more than `n` consecutive matches of some subpattern. — Martin Ender, Jun 19 '13 at 10:28
I figured that referred to his example input/output as well (as opposed to using split and matching something else). — Martin Ender, Jun 19 '13 at 11:33