RegEx works, but I don't know why!? Explanation

Question

Not to anger the Python gods, but I need an explanation on something that works. I'm working through the output of ARP tables in Cisco routers. I'm filtering everything before the IP address and after the MAC address. (Easy) Then I needed to filter out the ARP Age in-between the IP & MAC. This could and varying number of spaces followed by a hyphen or 1 to 3 digits then more spaces.

I was catching the hyphen or a single digit, but never 2 or 3 digits and the surrounding spaces. I had to put in pattern 4 to make it work. Shouldn't the \d+ in strPattern3 catch [spaces][hyphen or digits][spaces]?

    strPattern3 = re.compile('(\s+[-\d+]\s+)')  #Catch any spaces then a hypen or digits followed by spaces (ARP age)
    strPattern4 = re.compile('(\s+\d+\s+)')     #Catch any spaces then any digits then any more spaces (ARP age)

    szResult = strPattern3.sub('\t', szResult)
    szResult = strPattern4.sub('\t', szResult)


    SAMPLE ARP TABLE
        Internet  10.241.130.14         159   f0d5.bf04.e3b8  ARPA   GigabitEthernet0/0.20
        Internet  10.241.130.17           1   ecf4.bb6b.918a  ARPA   GigabitEthernet0/0.20
        Internet  10.241.130.19          47   f01f.af10.7a45  ARPA   GigabitEthernet0/0.20
        Internet  10.241.130.20           0   5475.d0ab.a86c  ARPA   GigabitEthernet0/0.20
        Internet  159.142.132.97          -   6073.5cc5.6598  ARPA   GigabitEthernet0/0.20

wkl · Accepted Answer · 2017-10-04T18:09:29.327

Using the [] to surround -\d+ means you're using a character class in Python regular expressions. It means match anything contained between the [], so it would look for literal -, a digit, or the + character. The + quantifier loses its meaning in a character class.

If you want to match a sequence of spaces, followed by hyphen or 1-3 digits, then more spaces, your regex would look more like this:

pattern = re.compile('(\s+(?:-|\d{1,3})\s+)')

TemporalWolf · Answer 2 · 2017-10-04T18:32:20.277

First, you don't need regex for this issue:

for line in s.split('\n'):  # or open a file and read it line by line
    if "ARPA" in line:  # or some other indicator of target lines
        sline = line.split()
        ip, mac = sline[1], sline[3]
        print ip, mac

yields

10.241.130.14 f0d5.bf04.e3b8
10.241.130.17 ecf4.bb6b.918a
10.241.130.19 f01f.af10.7a45
10.241.130.20 5475.d0ab.a86c
159.142.132.97 6073.5cc5.6598

If you must use regex, in the future I'd recommend using regex101.com or some other regex tester on sample data. Note it includes both visual match identification as well as an explanation via breakdown for the regex itself.

In this case, the regex you're looking for is probably \s+(?:-|\d+)\s+ (click it to see it on regex101) which is:

at least one space,
either a dash or any number of digits,
at least one space.

(?:a|b) is a non-capturing group which tells the or, |, we want to or only a and b, not the rest of the regex, and the outer grouping () is not necessary to use re.sub.

RegEx works, but I don't know why!? Explanation

2 Answers2