-1

What does the whitespace in Python RegEx ^(.+?(\d*)) *$ mean?

pat = re.compile('^(.+?(\d*)) *$',re.M)

Does * mean \s*?

Can the whitespace be ignored? i.e. is ^(.+?(\d*)) *$ same as ^(.+?(\d*))*$?

I ran some examples, and it seems that the answers to the above two questions are no.

Thanks!

Tim
  • 1
  • 141
  • 372
  • 590
  • 1
    The space is a space, `\s` includes space, newlines, tabs and much more depending on the language. – HamZa May 25 '14 at 22:58
  • No, ` *` is any number of space characters. `\s` matches more whitespace than just ` `. – Blender May 25 '14 at 22:58
  • but can white space be used without using `\\` in front for escaping? – Tim May 25 '14 at 23:00
  • @Tim You don't need to escape a space. Basically what you have is "repeat the space zero or more times". Where space is `0x20`. Check [this post](http://stackoverflow.com/questions/9291474/how-to-choose-between-whitespace-pattern/21067350#21067350) about `\s`, it depends on the language what it matches exactly. – HamZa May 25 '14 at 23:02
  • @Tim: White space doesn't need to be escaped. If it helps you read the regex more clearly, you can use `[ ]` instead of just a space. – Blender May 25 '14 at 23:08
  • Note that if you use verbose mode (which I highly recommend), whitespace that you don't want the regex engine to ignore *does* need escaping, either with a backslash or by putting it in a character class. – user2357112 May 25 '14 at 23:17
  • The `\s*` is much clearer and almost certainly less buggy, than naked ' ', if that is intended why not use [ ]* so the reader is sure it's deliberate. If you are looking for trailing spaces, generaly tabs ought to count to. – Rob11311 May 25 '14 at 23:42

2 Answers2

3

* means 0 or more occurances, $ anchors the match to the end of line, so it's allowing (probably) trailing spaces, but not tabs, unless it's actually a tab.

No if you remove that white space, lines with invisible spaces after them won't match.

As it stands it's matching a line sequence of one or more non-digits, followed by optional digits and optional spaces.

Actually if debugging I'd have to look up what happens on a line like "12345 " with the non-greedy matching as I'd tend to write myself something like "^(\D+(\d+))\s*$" or "^(\D*.(\d+))\s*$" depending on intention. In old days you had to code against the greedy matching yourself, which means I generally avoid stuff like .+(\d*) through habit. Capturing 0 digits generally is a bug, as is having first digit consumed by .+

Rob11311
  • 1,396
  • 8
  • 10
2

You can test this out for yourself on an online regex tool such as http://www.regex101.com

It's just a space character.

For your info, \s is actually 'whitespace', so it matches tabs, form feeds and other characters as well as spaces Whitespace link

Community
  • 1
  • 1
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56