0

So I have this code:

(r'\[quote\](.+?)\[/quote\]')

What I want to do is to change the regex so it only matches if the text within [quote] [/quote] is between 1-50 words.

Is there any easy way to do this?

Edit: Removed confusing html code in the regex example. I am NOT trying to match HTML.

Spindel
  • 279
  • 5
  • 18

2 Answers2

2

Sure there is, depending on how you define a "word."

I would do so separately from regex, but if you want to use regex, you could probably do:

r"\[quote\](.+?\s){1,49}[/quote\]"

That will match between 2 and 50 words (since it demands a trailing \s, it can't match ONE)

Crud, that also won't match the LAST word, so let's do this instead:

r"\[quote\](.+?(?:\s.+?){1,49})\[/quote\]"
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • +1. And good point talking about the definition of a word. E.g. how many words does "匈牙利共和国中国古称马扎儿" have? – Hyperboreus Mar 03 '14 at 19:41
  • That's a good question. Maybe counting characters is a better way to go? And if so, could you update your answer with example of character count instead of word count? – Spindel Mar 03 '14 at 19:47
  • `(.+?\s){1,49}` Even though lazy quantifier, this will match 5000 + 49 whitespace to get at the `[/quote]` if it has to. –  Mar 03 '14 at 19:50
  • @sln I think you're mistaken. I don't have any quantifier after the `\s`, and my `{m,n}` is after the capture group. It must in fact be between 1 and 49 instances of `SOMETHING` followed by whitespace. – Adam Smith Mar 03 '14 at 19:55
  • Btw, performance wise, could it be faster to count characters instead of words? – Spindel Mar 03 '14 at 20:18
  • 1
    @adsmith - No, not mistaken. The dot `.` matches whitespace. If you have a string like this `[quote] < 40 million whitespace's > [/quote]` your regex will match it. –  Mar 03 '14 at 21:17
  • 1
    @adsmith - You could however exclude whitespace in leiu of the dot. `\[quote\](\S+(?:\s\S+){1,49})\[/quote\]"` , but you have to work out edge conditions. –  Mar 03 '14 at 21:30
1

This is a definite misuse of regexes for a lot of reasons, not the least of which is the problem matching [X]HTML as @Hyperboreus noted, but if you really insist you could do something along the lines of ([a-zA-Z0-9]\s){1}{49}.

For the record, I don't recommend this.

rpmartz
  • 3,759
  • 2
  • 26
  • 36
  • Maybe I should have removed the
    part because it is what I am replacing the BBcode with, not part of the match itself.
    – Spindel Mar 03 '14 at 19:48