-3

I'm using Python 2.7.

I would like to know the difference between * and .* while matching the words.

Following is the code in python

exp = r'.*c' #here is the expression
line = '''abc dfdfdc dfdfeoriec''' #the words I need to match 
re.findall(exp,line) #python expression

The output from the above code is:

['abc dfdfdc dfdfeoriec']

If I change exp value to:

exp = r'*c'

...then on execution I get the following error:

Traceback (most recent call last):   File "<stdin>", line 1, in
<module>   File "C:\Program
Files\Enthought\Canopy32\App\appdata\canopy-1.0.0.1160.win-x86\lib\re.py",
line 177, in findall
    return _compile(pattern, flags).findall(string)   File "C:\Program Files\Enthought\Canopy32\App\appdata\canopy-1.0.0.1160.win-x86\lib\re.py",
line 242, in _compile
    raise error, v # invalid expression error: nothing to repeat

Here is another code

exp = r'c.*'
line1='''cdlfjd ceee cll'''
re.findall(exp,line1)

The output from above code is

['cdlfjd ceee cll']

If I change the exp value to:

exp = r'c*'

and then on execution I get the following output.

['c', '', '', '', '', '', '', 'c', '', '', '', '', 'c', '', '', '']

Please explain this behavior.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Bhairav Gooli
  • 63
  • 2
  • 8

6 Answers6

2

From the docs:

'*'

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

IN r'*c" you've no preceding character to repeat, so it's an error.

>>> import re
>>> strs = "ccceeeddc"
>>> re.findall(r'c*',strs)
['ccc', '', '', '', '', '', 'c', '']
         |   |   |   |   |        |
         e   e   e   d   d        nothing found after last `c`

c* means find all 'c's(0 to any number of time) that are next to each other and group them , so here when it reaches 'e' no 'c' is found so it returns en empty string.

'.*c': Group everything up to the last c found.

>>> strs = "abccccfoocbar"
>>> exp = r'.*c'
>>> re.findall(exp,strs)
['abccccfooc']

>>> strs = "qwertyu"
>>> re.findall(exp,strs)  #no 'c' found
[]

'c.*': This is exact opposite of the last one, group all characters that are fuund after first 'c'.

>>> exp = r'c.*'
>>> strs = "abccccfoocbar"
>>> re.findall(exp,strs)
['ccccfoocbar']

>>> strs = "qwertyu"
>>> re.findall(exp,strs) #no 'c' found
[]
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
1

The docs are quite clear, '*' repeats the previous 0 or more times. Seems like you want to ignore the reqular expression docs and translate it's meaning to what you know from some other domain, such as DOS wildcards or shell expansions.

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
1

In regural expressions the 'x*' matches a sequence of 0 or more 'x' (where x can be anything). By default, '*' is greedy, which means it tries to match as many characters as possible.
Also, the '.' character in regular expressions matches any character.

Thus, .* means: Match a sequence of length 0 or more that contains any characters.

Explanation of your patterns

.*c: Match the longest possible sequence containing any character and followed by the character 'c'.

*c: Match the longest possible sequense containing...oops you didn't specify what is allowed in the sequense, ERROR thrown.

c.*: Match the character 'c' followed by the longest possible sequence containing any character.

c*: Match the longest possible sequence containing only character 'c'. Note that longest possible can also mean "of 0 length" (that's why you get those empty strings).


You might want to find this links useful for further reading on Regular Expressions:

gkalpak
  • 47,844
  • 8
  • 105
  • 118
0

It's not a python thing but regex thing in general. * means match any number of the previous, when . means "everything", when you do *.c it means match any number of chars that ends with the char 'c'. When you do *c it's illegal because there is no expression before the *.

When you do c* it means "match any number of consecutive 'c'"

I suggest you do some reading about regular expressions.

JJJ
  • 32,902
  • 20
  • 89
  • 102
Tomer Arazy
  • 1,833
  • 10
  • 15
0

* just means, that the char in front of it can match from 0 up to many times. hence *c doesn't work. But c* will match c or cc or ccc... :) Whereas . means match any ONE char. .* means match any sequence of arbitrary chars.

Maybe you should consider reading an introduction to regex!

enpenax
  • 1,476
  • 1
  • 15
  • 27
0

* is an operator meaning 0 or more of the left match.

. is an operator meaning match anything.

c* would mean match 0 or more c's

.*c will match to 0 or more of anything, and then a c.

DivinusVox
  • 1,133
  • 2
  • 12
  • 27