1

I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.

Some examples for clarification:

The early bird catches the worm.

should match any three words in sequence (including the worm*).

Here we have a supercalifragilisticexpialidocious sentence.

"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.


* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.

Tobias Kienzler
  • 25,759
  • 22
  • 127
  • 221
  • 2
    Are you using a language here? This problem would be much more tractable IMO if you were using something like Java. Regex isn't the answer for everything. – Tim Biegeleisen May 18 '16 at 14:33
  • 2
    Well, try [`(?!(?:\s*\w){21})\b\w+(?:\s+\w+){0,2}\b`](https://regex101.com/r/yR0qK1/1), it will require matching at least 1 word – Wiktor Stribiżew May 18 '16 at 14:33
  • 1
    @WiktorStribiżew 6695 steps, this is not so performatic – fdfey May 18 '16 at 14:40
  • 2
    @fdfey: Ok, move the lookahead after `\b`: [`\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}\b`](https://regex101.com/r/yR0qK1/2). The idea is pretty much the same. – Wiktor Stribiżew May 18 '16 at 14:42
  • @Tobias : Do you want a maximum of 20 characters for the 3 words combined? Or do you want max 20 chars per word? – LukStorms May 18 '16 at 15:26
  • @TimBiegeleisen The "language" is a GREP-style for Adobe InDesign which is an option to apply a character formatting to anything matching a given RegEx. I definitely agree with your assessment though [:)](https://stackoverflow.com/a/1732454/321973) – Tobias Kienzler May 19 '16 at 06:00
  • @JanDvorak hint: I mentioned that in my question ;) but thanks anyway, that was already one difficult thing to figure out – Tobias Kienzler May 19 '16 at 06:03
  • @WiktorStribiżew Seems good, you should post this as answer so I can accept it – Tobias Kienzler May 19 '16 at 06:04
  • @LukStorms No more than 20 characters for their combination - I guess up to twenty characters per word would be easier? Something like `(?:\s\w{1,20}){3}` I think – Tobias Kienzler May 19 '16 at 06:05
  • @TobiasKienzler Wiktor doesn't need the points, but [I do](http://stackoverflow.com/users/1863229/tim-biegeleisen). – Tim Biegeleisen May 19 '16 at 06:06
  • Oh, wait, you wanted the longest substring starting at a given point that satisfies both conditions? I'm afraid I can't help you here. Technically the language is regular, but translating the DFA to regex won't result in anything pretty. – John Dvorak May 19 '16 at 06:07
  • 2
    May I extract a PPCG challenge out of your question? With a bit of luck, an answer will pop up that is also applicable as an answer to your question. – John Dvorak May 19 '16 at 06:10
  • @TimBiegeleisen :D The points are a nice to have, but the [point](https://xkcd.com/559/ "pun intended...") is that comments should be used for clarification and (even partial) answers should be possible to accept and separately discussed. Otherwise we end up with many apparently [unanswered](https://stackoverflow.com/unanswered) questions that turn out to have the answer hidden in a comment – Tobias Kienzler May 19 '16 at 06:11
  • @JanDvorak Very good point, I always wanted to post one there :) I'll add a link here afterwards. – Tobias Kienzler May 19 '16 at 06:11
  • 1
    Guys, it takes time to write an answer. I am onthe bus and my mobile battety is almost flat. I eill try to copy paste my comment to answrr and will update once I'm at the desktop. – Wiktor Stribiżew May 19 '16 at 06:20
  • @WiktorStribiżew No rush, don't worry :) Thanks for your two versions so far, they look promising. As Jan suggested I'll post a PPCG challenge while we're at it :D – Tobias Kienzler May 19 '16 at 06:21
  • @JanDvorak Alright, I posted a challenge: https://codegolf.stackexchange.com/q/80179/2775 – Tobias Kienzler May 19 '16 at 08:44

1 Answers1

1

To match 3 words separated with whitespace(s) at the end of a line or string, you can use

\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])

See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.

Regex explanation

  • \b - a word boundary
  • (?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
  • \w+ - 1 word (consisting of 1 or more word characters)
  • (?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
  • (?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).

Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It took me a bit longer though. Please have a look and let me know if that is OK. – Wiktor Stribiżew May 19 '16 at 06:49
  • Thanks, looks good. I started a slightly stricter challenge at https://codegolf.stackexchange.com/q/80179/2775, please feel free to join in :) – Tobias Kienzler May 19 '16 at 08:45
  • 1
    [`\b(?!([^\w\r\n]*\w){21})\w+(?:[^\w\r\n]+\w+){0,2}(?=[^\w\r\n]*$)`](https://regex101.com/r/xZ3fW0/1) :) – Wiktor Stribiżew May 19 '16 at 08:53
  • Put it in parentheses to get groups and you've got an answer scoring 66*2415 = 159390 :) – Tobias Kienzler May 19 '16 at 09:00
  • I do not know code golf, and no idea what that score means :) – Wiktor Stribiżew May 19 '16 at 09:00
  • Code golf is about scoring as low as possible. Usually shortest code, but in this challenge I multiply by the amount of steps in order to punish bad performance. – Tobias Kienzler May 19 '16 at 09:02
  • 1
    Oh :) With the capturing group around the whole pattern version (and with possessive quantifiers): [`\b(?!([^\w\r\n]*+\w){21})(\w++(?:[^\w\r\n]++\w++){0,2}+)(?=[^\w\r\n]*+$)`](https://regex101.com/r/xZ3fW0/2). – Wiktor Stribiżew May 19 '16 at 09:11
  • [74x2082 = 154068](https://regex101.com/r/jG5lO2/1) if you make the first group non-capturing – Tobias Kienzler May 19 '16 at 09:17
  • Sure, it must be non-capturing. You have not specified what regex flavor it is for. My above suggestion is for PCRE. – Wiktor Stribiżew May 19 '16 at 09:18
  • PCRE is fine - I didn't consider there would be too many differences, so I forgot to specify... Please submit it as an answer to [the challenge](https://codegolf.stackexchange.com/q/80179/2775) :) – Tobias Kienzler May 19 '16 at 09:23
  • I [tweaked](https://regex101.com/r/rM2pL7/2) your suggestion a bit, though I'm not sure it still always works: [`\b(?!.{21})(\w+(?:\W+\w+){0,2})(?=\W*$`)](https://regex101.com/r/rM2pL7/3) – Tobias Kienzler May 19 '16 at 11:08
  • 1
    Yes, because you are checking if the whole substring after the initial word boundary is not longer than 20 symbols (with `(?!.{21})`), not just the number of word characters. That is a simplified version for independent strings. – Wiktor Stribiżew May 19 '16 at 11:16