Regex match only if word count between 1-50

Question

So I have this code:

(r'\[quote\](.+?)\[/quote\]')

What I want to do is to change the regex so it only matches if the text within [quote] [/quote] is between 1-50 words.

Is there any easy way to do this?

Edit: Removed confusing html code in the regex example. I am NOT trying to match HTML.

Before you go on with your code, please take a look at the most upvoted answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — Hyperboreus, Mar 03 '14 at 19:35
What would you do if there are more than 50 'words', leave them alone? — , Mar 03 '14 at 19:56
What I really want is another regex for quotes more than 50 words. — Spindel, Mar 03 '14 at 20:04
Then, `1-50 < n < 51 - infinity` roughly equals `r'(?s)\[quote\]((?:(?!\[/quote\]).)+)\[/quote\]'` — , Mar 03 '14 at 21:24
@Spindel - Really? `(?s)` Dot-All. I asume you didn't try to test it.. — , Mar 05 '14 at 18:40

score 2 · Answer 1 · answered Mar 03 '14 at 19:39

2

Sure there is, depending on how you define a "word."

I would do so separately from regex, but if you want to use regex, you could probably do:

r"\[quote\](.+?\s){1,49}[/quote\]"

That will match between 2 and 50 words (since it demands a trailing \s, it can't match ONE)

Crud, that also won't match the LAST word, so let's do this instead:

r"\[quote\](.+?(?:\s.+?){1,49})\[/quote\]"

answered Mar 03 '14 at 19:39

Adam Smith

52,157
12
73
112

+1. And good point talking about the definition of a word. E.g. how many words does "匈牙利共和国中国古称马扎儿" have? – Hyperboreus Mar 03 '14 at 19:41
That's a good question. Maybe counting characters is a better way to go? And if so, could you update your answer with example of character count instead of word count? – Spindel Mar 03 '14 at 19:47
`(.+?\s){1,49}` Even though lazy quantifier, this will match 5000 + 49 whitespace to get at the `[/quote]` if it has to. – Mar 03 '14 at 19:50
@sln I think you're mistaken. I don't have any quantifier after the `\s`, and my `{m,n}` is after the capture group. It must in fact be between 1 and 49 instances of `SOMETHING` followed by whitespace. – Adam Smith Mar 03 '14 at 19:55
Btw, performance wise, could it be faster to count characters instead of words? – Spindel Mar 03 '14 at 20:18
1

@adsmith - No, not mistaken. The dot `.` matches whitespace. If you have a string like this `[quote] < 40 million whitespace's > [/quote]` your regex will match it. – Mar 03 '14 at 21:17
1

@adsmith - You could however exclude whitespace in leiu of the dot. `\[quote\](\S+(?:\s\S+){1,49})\[/quote\]"` , but you have to work out edge conditions. – Mar 03 '14 at 21:30

score 1 · Answer 2 · answered Mar 03 '14 at 19:41

1

This is a definite misuse of regexes for a lot of reasons, not the least of which is the problem matching [X]HTML as @Hyperboreus noted, but if you really insist you could do something along the lines of ([a-zA-Z0-9]\s){1}{49}.

For the record, I don't recommend this.

answered Mar 03 '14 at 19:41

rpmartz

3,759
2
26
36

Maybe I should have removed the
part because it is what I am replacing the BBcode with, not part of the match itself.
– Spindel Mar 03 '14 at 19:48

Regex match only if word count between 1-50

2 Answers2