1

I am beginner programmer and lately I started learning how to use Regular Expression module for Python. After reading theory and examples on various websites, I decided to mess around with REGEX myself, to practice it a bit. However I stumbled upon a problem I cannot wrap my head around.

As long as the regular expression is simple, I think I understand what's going on, but when () brackets appear, my logic completely falls apart. There is clearly a massive hole in my understanding of REGEX and I need help finding it.

Here is the example of my problem:

text= "Some numbers here: 1, 12 , 123, 123.4, -123.45 , 123456."
option1= re.findall(r"-?[0-9]+\.?[0-9]*", text)
print(option1)
option2= re.findall(r"-?[0-9]+\.?[0-9]+", text)
print(option2)
option3= re.findall(r"-?[0-9]+(\.[0-9]+)?", text)
print(option3)

I got the following output:

['1', '12', '123', '123.4', '-123.45', '123456.']
['12', '123', '123.4', '-123.45', '123456']
['', '', '', '.4', '.45', '']

My question is: why result of option 3 is "['', '', '', '.4', '.45', '']" instead of "[ '1' '12', '123', '123.4', '-123.45', '123456']"

Now I will describe my thought process, so maybe you guys will have easier time pointing out what I am doing wrong here. I think I understand what is going on in option1 and option2, because the result is exactly what I expected.

In first expression, there is "-?" so "-" is optional character at the beginning, followed by [0-9]+ which matches to at least one digit. Then there is \.? so single "." character can appear after the digits and finally [0-9]* which means that at the end there can be another 0 or more digits. And the output is exactly what I expected it to be. Every number matches, including 123456. (with dot).

In second expression, everything is the same as in first, except the expression ends with [0-9]+ insted of [0-9]*. As I expected, output is similar, but 123456 is matched without a "."(because there must be at least one digit after ".") However, "1" is no longer matched, because [0-9]+ section appears twice, so match needs at least two digits.

With third option, I wanted to get "1" match and "123456" match without "." My reasoning was like this: -? as optional "-", [0-9]+ as at least one digit, and then by putting "\.[0-9]+"(dot and at least one digit) as optional block, so I put it in bracket and placed ? after it to make it so this whole "\.[0-9]+" block may happen once, or may not happen at all.

However, instead of getting "[ '1' '12', '123', '123.4', '-123.45', '123456']" I got this: "['', '', '', '.4', '.45', '']"

I don't understand what is going on. It looks like the bracket section works fine(.45, .4), however rest of the expression seems to be broken. There is clearly a hole in my understanding of brackets and I would appreciate it if someone helped me see what I am doing/understanding wrong.

Galbatrollix
  • 58
  • 1
  • 4

0 Answers0