Regex to check for more than 999 words

Question

I need a regex that will check the input of a textarea form and detect whether the form contains more than 999 words. This is language independent, i.e. I am using a form plugin that accepts regexs for validation.

Isn't your plugin executing your regex in a specific library/language ? I don't see how this could be language independent. — Denys Séguret, Nov 23 '12 at 20:49
Let's see your current regex and we'll help you fix it. Or have you tried nothing yet? — Wesley Murch, Nov 23 '12 at 20:49
@Asad that should be an answer! except that it should be MORE THAN 999 words, not equal or more — Billy Moon, Nov 23 '12 at 20:51
@Asad `\w` usually matches just the ascii letters, so `"café"` wouldn't be matched. Besides, checking whether a form has more than 999 words, it is (IMO) safe to assume that there is punctuation in the input, which yous suggestion does not account for. — Bart Kiers, Nov 23 '12 at 20:58
@Fred, what about punctuation (will it be present)? And it's not language independent: almost all regex flavors differ (slightly). So, what language are you, or the framework, using? — Bart Kiers, Nov 23 '12 at 21:01
I don't understand why this should have produced a minus to my reputation on the grounds that it was a nonconstructive question. There were a lot of constructive answers. I don't mean to be a whiner, but I don't use stackoverflow that often and I need every point I can get. — Fred Zimmerman, Oct 26 '15 at 16:31

score 1 · Accepted Answer · answered Nov 23 '12 at 21:03

1

All you need is to test simple regex match against input string. Use regex pattern

(?:\b\w+(?:\W+|$)){1000}

If you need add a unicode support, use pattern

(?:\b[\w\p{L}]+(?:[^\w\p{L}]+|$)){1000}

answered Nov 23 '12 at 21:03

Ωmega

42,614
34
134
203

score 1 · Answer 2 · answered Nov 24 '12 at 01:16

I suspect everyone's making this more difficult than it needs to be. Do you really care if the "words" are words in the linguistic sense? Or will this do?

\S+(?:\s+\S+){999}

If so, and if your regex flavor supports possessive quantifiers, the actual regex I recommend is:

\S++(?:\s++\S++){999}

This will fail much more quickly when no match is possible. For example, when I try to match a string with exactly 999 words in RegexBuddy, the first regex takes 21,870 steps to fail, while the possessive version only takes 3,996 steps. If you don't have possessive quantifiers but you do have atomic groups, this one takes 4,008 steps to fail:

\S+(?>\s+\S+){999}

Performance is probably irrelevant, given that you're using the regex to validate user input. I brought it up because it would be very easy in these circumstances to create a regex that locks up your machine. And that usually happens in cases where's no match to be found. When you test regexes, you should have at least as many non-matching tests as matching ones.

score 0 · Answer 3 · answered Nov 23 '12 at 20:57

0

Because @Asad seems to be shy putting an answer

(\b\w+\b\s+){1000,}

Where it matches a word boundary (\b) followed by a character that can be part of a word one or more times (\w+) followed by a word boundary, and one or more space characters (\b\s+ - space can also be tab etc...) at least a 1000 times ((...){1000,})

answered Nov 23 '12 at 20:57

Billy Moon

57,113
24
136
237

Highly unlikely to be the correct answer. Your *words* won't match `"café"`, and will not account for input consisting of more than 999 words with a single punctuation mark in the middle. Your proposed regex also will not match if the input doesn't end with a `\s`. – Bart Kiers Nov 23 '12 at 21:03
@Asad indeed - here is a useful post all about it: http://stackoverflow.com/questions/7292395/how-to-match-accented-characters-with-a-regex - it is hard to answer definitively when the question does not include a full scope of what they are looking for... – Billy Moon Nov 23 '12 at 21:03
@BartKiers You could just add those to a character class with `\w` characters, plus turn the `+` into a `*` – Asad Saeeduddin Nov 23 '12 at 21:05
@Billy, in that case, I wouldn't post an answer. Just ask the OP for clarification in the comments, and when the question is clear enough, post your answer. – Bart Kiers Nov 23 '12 at 21:05
Usually when I want to count words, I just count the number of sequential groups of spaces (or newlines) and add one using `/[\s\r\n]+/`. But I liked the approach of @Asad, and expect it could lead to a better solution. – Billy Moon Nov 23 '12 at 21:05
@BillyMoon, yes, counting non-word-patterns (like `\s+`), or splitting the input on them and checking the length of the array is a (IMHO) better way! – Bart Kiers Nov 23 '12 at 21:07
@BartKiers I think you should post an answer, and try to provoke discussion, I think you have a good idea. I hope to see an interesting solution put forward. – Billy Moon Nov 23 '12 at 21:10
@BillyMoon, perhaps I will after the OP clarifies himself :) – Bart Kiers Nov 23 '12 at 21:12
@BillyMoon: Just FYI, `\s` already matches carriage returns (`\r`) and linefeeds (`\n`), so `[\s\r\n]` is redundant. – Alan Moore Nov 24 '12 at 01:21

score 0 · Answer 4 · answered Nov 23 '12 at 21:20

0

Use a look ahead:

^(?=(.*\b\w+\b){1000,})

Note that this is an Anglo-centric solution. For other languages, the \w would need to be replaced with a "not punctuation or spaces" regex or similar. Also, this doesn't cater for apostrophes in words.

answered Nov 23 '12 at 21:20

Bohemian

412,405
93
575
722

What does a lookahead do for you in this case that a straight match can't? Anyway, you should change that `.*` to `\W*`. I used RegexBuddy to test your regex on a 1,000 word string (which should have matched), and it bailed out after a million steps. After I changed the `.` to `\W`, it reported a successful match in 5,007 steps. – Alan Moore Nov 24 '12 at 01:42

score 0 · Answer 5 · answered Nov 24 '12 at 00:02

0

Here is an expression that counts the number of blocks of non whitespaces.

^(?>\s*\S+){1000,}\s*$

This isn't a perfect solution since it counts 2 words in the following string "Ambassador T'Pel", when in reality there are 3 words. But it keeps the regex very simple and it may be good enough for your requirements.

This regex is also very fast as it keeps backtracking at a minimum.

answered Nov 24 '12 at 00:02

Francis Gagnon

3,545
1
16
25

`word1,word2,word3,word4,word5` is one word for your solution – Ωmega Nov 24 '12 at 00:24
@Ωmega - Like I said, not perfect. But your string is unusual in a piece of standard English text. Most commas have a space after them. – Francis Gagnon Nov 24 '12 at 00:28

Regex to check for more than 999 words

5 Answers5