Inconsistent regex behaviour in Python

Question

I have a regular expression that I have verified works correctly, proof is here: https://regex101.com/r/ffSVuD/6

Unfortunately when I use the same regex within some Python code I do not get the same behaviour. The regex does get a match, but it does not find the same match groups.

Here is some demo code:

import re
ddl_string = """
CREATE TABLE default.test_parse_partitioned_table__using_parquet_1_082921496561 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format \\'1\\'
)
PARTITIONED BY (DATA2, DATA3)
"""
regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
if match.group("pcol"):
    print match.group("pcol").strip()
else:
    print 'did not find any pcols in {matches}'.format(matches=match.groups())

which returns:

did not find any pcols in (None, 'default.', 'test_parse_partitioned_table__using_parquet_1_082921496561 ', 'DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT', 'parquet', None, None, None)

My intention is to populate DATA2, DATA3 into match.group("pcol") but as you will observe that is not happening. In my aforementioned regex verification at https://regex101.com/r/ffSVuD/6 it does find a match:

I have fiddled around quite a lot trying to get a regex that will return what I need but no success hence this post. Can anyone help?

1) In your python code you've enabled the `re.MULTILINE | re.DOTALL` flags. 2) There's a small difference in the regex pattern: `CREATE +` vs `CREATE +?`. 3) The text you're using the regex on isn't the same. — Aran-Fey, Mar 09 '18 at 10:28
Your regex does not match when PARTITIONED BY (and OPTIONS) is on a new line — Daniel Roseman, Mar 09 '18 at 10:28
thank you both, you are both correct. I had ddl_string over multiple lines. Fixed now, I think. — jamiet, Mar 09 '18 at 10:31

score 0 · Answer 1 · answered Jun 19 '22 at 18:06

I'm not sure, but I have the impression in case there're multiple matches with grouped names it is hard to get these retrieved by the name.

python code - to run

import re
import inspect

ddl_string = """
CREATE TABLE default.test_parse_partitioned_table__using_parquet_1_082921496561 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format \\'1\\'
)
PARTITIONED BY (DATA2, DATA3)
"""
regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'

text2 = "DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT, data9"
pattern2 = re.compile(r"(?P<pcol>((((DATA)|(data))\d)*)?)*")

text = text2 #ddl_string
pattern = re.compile(pattern2) #regex)

def part01():
    print ("\npart 01")
    lst_data = []
    if ( ((matches)) != None ):
        print(' Found!')
        
        print (matches)
        for elem in matches:
          if (elem[0] != '') : 
            print('--> {0}'.format(elem[0]))
            lst_data.append(elem[0])
    else:
        print('No match Found!\n')
    return lst_data
    
def part02():
    print ("\npart 02")
    if ( ((matches)) != None ):
        print(' Found!')
        print (matches)
        
        #print (matches.groups)
        for elem in matches:
          if (elem.group(0) != '') : 
             print(elem.group(0))

def part03():
    print ("\npart 03")
 
    if ( ((matches)) != None ):
        lst_grp = list(matches)
        print(' lst_grp[0]               : {0}'.format(lst_grp[0]))
        print(' lst_grp[0].re.groupindex : {0}'.format(lst_grp[0].re.groupindex))
        
        # https://stackoverflow.com/questions/28856238/how-to-get-group-name-of-match-regular-expression-in-python
        # groupdict - None means a group never was used in a match
        print(' lst_grp[0].groupdict()   : {0}'.format(lst_grp[0].groupdict()))
        
        for match in matches:
            print (" last : {} ".format(match.lastgroup))
    
def part04():
    print ("\npart 04")
    match = pattern.match(text)
    print(True) if match and 'pcol' in match.groupdict() else print(False)
    print("Found 'pcol'") if match and 'pcol' in match.groupdict() else print("Didn't found 'pcol'")
    print(match.group('pcol'))
    
#    names_used = [name for pcol, value in matchobj.groupdict().iteritems() if value is not None]

matches = (re.finditer(pattern, text))
returnedData01 = part01()
print (' returned --> {0}'.format(returnedData01))
# print (type(returnedData01))

matches = (re.finditer(pattern, text))
part02()

matches = (re.finditer(pattern, text))
part03()

matches = (re.finditer(pattern, text))
part04()

Inconsistent regex behaviour in Python

1 Answers1

Linked