1

I am using this re.match call to get only "proper" strings:

re.match('^[A-Za-z0-9\.\,\:\;\!\?\(\)]', str)

But I am getting some garbage too, like # and _. How is that possible? What am I doing wrong?

Thanks!

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
striatum
  • 1,428
  • 3
  • 14
  • 31
  • 4
    Can you show sample input, and output? – Rohit Jain Sep 03 '13 at 14:15
  • 1
    `print re.match('^[A-Za-z0-9\.\,\:\;\!\?\(\)]', "#")` returns `None`, as expected. Please clarify the question with some examples. – Stefano Sanfilippo Sep 03 '13 at 14:18
  • 1
    You don't need the start-of-line `^` anchor with `match()` because it only finds matches at the beginning of the string. – Ibrahim Najjar Sep 03 '13 at 14:22
  • Usually you don't have to escape special characters other than `[]` inside a character-class. – Vince Sep 03 '13 at 14:22
  • You don't have to escape anything (except `^`) inside character sets, just FYI. – Henry Keiter Sep 03 '13 at 14:22
  • None of those characters are metacharacters inside the brackets - they need not be escaped. Some of them, (`,` and `;`) aren't metacharacters at in regex at all. Unnecessary escapes turn your regex into unreadable character soup - I would recommend not using them. – FrankieTheKneeMan Sep 03 '13 at 14:22
  • Actually, Vince and henry, you're both kind of right, and kind of wrong - inside a character set you only **need** to escape the closing bracket `]`, but the characters `^` and `-` both have special meanings relative to their position in the set, and so may need to be escaped (or moved). – FrankieTheKneeMan Sep 03 '13 at 14:24
  • @FrankieTheKneeMan: The `]` can be moved at the first position, then you don't need to escape it. Take a look at this incredible post: http://stackoverflow.com/questions/17845014/what-does-the-regex-mean/17845034#17845034 – Casimir et Hippolyte Sep 03 '13 at 14:30
  • @CasimiretHippolyte Awesome. – FrankieTheKneeMan Sep 03 '13 at 14:35

1 Answers1

4

Use this to check all characters until the end of your string, otherwhise your pattern will only check the first character:

re.match('^[A-Za-z0-9.,:;!?()]+$', str)

Note that the character class doesn't contain spaces, newlines or tabs. You can add them like this:

re.match('^[A-Za-z0-9.,:;!?()\s]+$', str)

If you want to allow void strings you can replace the + quantifier by *

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • How come you don't use `\d` for `0-9`. Any special reason or is it just preference? –  Sep 03 '13 at 14:28
  • 1
    @iCodez: `\d` and `0-9` are not always the same since the meaning of `\d` can be "all digits in any language". `[0-9]` is not ambiguous – Casimir et Hippolyte Sep 03 '13 at 14:33