1

I am trying to create a list of dictionaries using regex positive lookbehind. I tried two different codes:

Variation 1

string = '146.204.224.152 - lubo233'

for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?P<user_name>(?<= - )[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Variation 2

string = '146.204.224.152 - lubo233'
for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?<= - )(?P<user_name>[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Desired Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

Question/Issue

In both cases, I am unable to eliminate the substring " - ".

The use of positive lookbehind (?<= - ) renders my code wrong.

Can anyone assist to identify my mistake? Thanks.

Kane Chew
  • 3,693
  • 4
  • 12
  • 24

2 Answers2

2

I'd suggest you remove the positive lookbehind and just put the join character normally, between each parts

Also some improvements

  • \. instead of [.]

  • [0-9]{,3} instead of [0-9]*

  • (?:\.[0-9]{,3}){3} instead of \.[0-9]{,3}\.[0-9]{,3}\.[0-9]{,3}

Add a .* along with the - to handle any word that could be there

rgx = re.compile(r"(?P<host>[0-9]{,3}(?:\.[0-9]{,3}){3}).* - (?P<user_name>[a-z]*[0-9]*)")

vals = ['146.204.224.152 aw0123 abc - lubo233',
        '146.204.224.152 as003443af - lubo233',
        '146.204.224.152 - lubo233']

for val in vals:
    for item in rgx.finditer(val):
        print(item.groupdict())

# Gives
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
azro
  • 53,056
  • 7
  • 34
  • 70
  • I understand what you mean. But what happens when there is an unknown length of characters between the two substrings: '146.204.224.152' and ' - lubo233'? Example: string = '146.204.224.152 aw0123 abc - lubo233' or string = '146.204.224.152 as003443af - lubo233' – Kane Chew Nov 28 '20 at 10:16
  • @KaneChew Please edit your initial post to add several examples of input string. Also my code does the same as yours and its working, your initial code did NOT handle a possible different content in the middle ;) – azro Nov 28 '20 at 10:24
1

The reason that the positive lookbehind is not working is that you are trying to match:

  • (?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*) an IP address
  • immediately followed by a user name pattern: (?P<user_name>(?<= - )[a-z]*[0-9]*) that should be preceded by (?<= - )

So once the regex engine has consumed the IP address pattern you are telling that should match a user name pattern preceded by (?<= - ) but what is preceding is the IP address pattern. In other terms, once the IP pattern has been matched the string left is:

- lubo233

The pattern that should be immediately matched, as in re.match, is:

(?P<user_name>(?<= - )[a-z]*[0-9]*) 

that obviously does not match. To illustrate my point, see that this pattern works:

import re

string = '146.204.224.152 - lubo233'
for item in re.finditer(r"((?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)( - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

If you need to match an arbitrary number of characters between the two patterns, you could do:

import re

string = '146.204.224.152 adfadfa - lubo233'
for item in re.finditer(r"((?P<host>\d{3,}[.]\d{3,}[.]\d{3,})(.* - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224', 'user_name': 'lubo233'}
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Following your train of thought, "once the regex engine has consumed the IP address pattern", the following substring is left: " - lubo233". In this case, isn't " - " preceding the user_name? Or am I not understanding regex properly? – Kane Chew Nov 28 '20 at 10:35
  • @KaneChew positive lookbehind does not consume the string. As you said the string left is " - lubo233" and you are telling it, that it should be preceded by " - ". – Dani Mesejo Nov 28 '20 at 10:44