Using regex to create a list of dictionaries with positive lookbehind

Question

I am trying to create a list of dictionaries using regex positive lookbehind. I tried two different codes:

Variation 1

string = '146.204.224.152 - lubo233'

for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?P<user_name>(?<= - )[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Variation 2

string = '146.204.224.152 - lubo233'
for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?<= - )(?P<user_name>[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Desired Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

Question/Issue

In both cases, I am unable to eliminate the substring " - ".

The use of positive lookbehind (?<= - ) renders my code wrong.

Can anyone assist to identify my mistake? Thanks.

azro · Answer 1 · 2020-11-28T10:27:50.680

2

I'd suggest you remove the positive lookbehind and just put the join character normally, between each parts

Also some improvements

\. instead of [.]
[0-9]{,3} instead of [0-9]*
(?:\.[0-9]{,3}){3} instead of \.[0-9]{,3}\.[0-9]{,3}\.[0-9]{,3}

Add a .* along with the - to handle any word that could be there

rgx = re.compile(r"(?P<host>[0-9]{,3}(?:\.[0-9]{,3}){3}).* - (?P<user_name>[a-z]*[0-9]*)")

vals = ['146.204.224.152 aw0123 abc - lubo233',
        '146.204.224.152 as003443af - lubo233',
        '146.204.224.152 - lubo233']

for val in vals:
    for item in rgx.finditer(val):
        print(item.groupdict())

# Gives
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}

edited Nov 28 '20 at 10:27

answered Nov 28 '20 at 10:11

azro

53,056
7
34
70

I understand what you mean. But what happens when there is an unknown length of characters between the two substrings: '146.204.224.152' and ' - lubo233'? Example: string = '146.204.224.152 aw0123 abc - lubo233' or string = '146.204.224.152 as003443af - lubo233' – Kane Chew Nov 28 '20 at 10:16
@KaneChew Please edit your initial post to add several examples of input string. Also my code does the same as yours and its working, your initial code did NOT handle a possible different content in the middle ;) – azro Nov 28 '20 at 10:24

Dani Mesejo · Answer 2 · 2020-11-28T10:51:25.397

The reason that the positive lookbehind is not working is that you are trying to match:

(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*) an IP address
immediately followed by a user name pattern: (?P<user_name>(?<= - )[a-z]*[0-9]*) that should be preceded by (?<= - )

So once the regex engine has consumed the IP address pattern you are telling that should match a user name pattern preceded by (?<= - ) but what is preceding is the IP address pattern. In other terms, once the IP pattern has been matched the string left is:

- lubo233

The pattern that should be immediately matched, as in re.match, is:

(?P<user_name>(?<= - )[a-z]*[0-9]*)

that obviously does not match. To illustrate my point, see that this pattern works:

import re

string = '146.204.224.152 - lubo233'
for item in re.finditer(r"((?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)( - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

If you need to match an arbitrary number of characters between the two patterns, you could do:

import re

string = '146.204.224.152 adfadfa - lubo233'
for item in re.finditer(r"((?P<host>\d{3,}[.]\d{3,}[.]\d{3,})(.* - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224', 'user_name': 'lubo233'}

Following your train of thought, "once the regex engine has consumed the IP address pattern", the following substring is left: " - lubo233". In this case, isn't " - " preceding the user_name? Or am I not understanding regex properly? — Kane Chew, Nov 28 '20 at 10:35
@KaneChew positive lookbehind does not consume the string. As you said the string left is " - lubo233" and you are telling it, that it should be preceded by " - ". — Dani Mesejo, Nov 28 '20 at 10:44

Using regex to create a list of dictionaries with positive lookbehind

2 Answers2