Efficient regex for reducing only fully duplicate phrases separated by a specific delimiter in Python

Question

Suppose I have a shopping list that looks as follows:

lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""

Only when the product is a complete duplicate should the product on the shopping list be reduced to a non-duplicate, i.e., ''[[excellent wheat|excellent wheat]]'' -> ''[[excellent wheat]]''. Non-complete duplicates should remain as they are.

I've looked through some other threads and cannot find an ideal solution.

I'd like to evaluate parts of the multi-line string line-by-line like this,

for i in range(0,100):
    lines[i] = regexHere(lines[i]) #regex expr here
    print lines[i]

and I wish for the following output:

''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''

Thanks.

EDIT: This worked for the given example. What if the shopping list was in a list with random lines of other formats?

lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
"""

Why do you want to use regex for this? Exact matches can be efficiently handled by string methods. Regex is only really needed when dealing with *expressions*, not constants. — MisterMiyagi, Aug 17 '20 at 19:55

score 2 · Answer 1 · answered Aug 17 '20 at 19:53

For this, you really don't need regex – you can just use straight string manipulation:

lines="""
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""

for line in lines.strip().split("\n"):
    first, second = line.split('|')

    if first[4:] == second[:-4]:
        print("''[[{}]]''".format(''.join(first[4:])))
    else:
        print(line)

"""
Output:
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""

dawg · Accepted Answer · 2020-08-17T21:41:29.473

You can do:

lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""

>>> print(re.sub(r'(?<=\[)([^[]*)(?=\|)\|\1(?=\])', r'\1', lines))

''[[excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''

Regex Demo

If you want more efficiency you can combine a simpler regex (without backtracking) with some Python string processing. I don't honestly know of this is faster or not:

lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
""".splitlines()

# Python 3.8+ because of the walrus. Break into two line if can't use that
for i, line in enumerate(lines):
    if m:=re.search(r'(?<=\[\[)([^\]\[]*)(?=\]\])', line):
        x=m.group(1).partition('|')
        if x[0]==x[2]:
            span=m.span()
            lines[i]=line[0:span[0]]+x[0]+line[span[1]:]
    
print('\n'.join(lines))

Prints:

==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz

Thank you very much for the answer! – Drummermean Aug 17 '20 at 20:24 — Drummermean, Aug 17 '20 at 20:24

Efficient regex for reducing only fully duplicate phrases separated by a specific delimiter in Python

2 Answers2