How to get semicolons except in parentheses with regex

Question

For the following C source code piece:

for (j=0; j<len; j++) a = (s) + (4); test = 5;

I want to insert \n after semicolons ; except in parenthesis using python code regex module.

For the following C source code piece:

for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;

The regex ;(?![^(]*\)) works but not on the first piece of code.

`(?<={)(.*?)(;)` would work. https://regex101.com/r/OBKJCr/1 — MonkeyZeus, Feb 27 '20 at 17:48
MonkeyZeus - thanks but this not work. I just want to get semicolons and don't use bracket — aria, Feb 27 '20 at 17:55
@aria No, that [does not work](https://regex101.com/r/SvJAFP/2). — Wiktor Stribiżew, Feb 27 '20 at 18:00
@aria Well good luck to you then because regex can't count so you have no way of tracking whether a parenthesis has closed or not. — MonkeyZeus, Feb 27 '20 at 18:04
Wiktor Stribiżew - yes you right but it's work -> ;(?![^\/\/]*\)) — aria, Feb 27 '20 at 18:14
I would recommend researching the answers to this question: https://stackoverflow.com/q/36953/816536 — Greg A. Woods, Feb 27 '20 at 20:10
So you have tagged "python" because that is how you are trying to solve the issue. However, from the sample line, you are dealing with, it looks like some C language variant. And I assume the underlying issue you are looking at is that "test = 5" looks like it is part of the "for" loop, but really isn't if you look closely. Perhaps you are trying to fix lintish errors or something. Most IDEs offer this kind of formatting per various standards. I recommend you let the IDE do it instead of writing a Python program. — Frank Merrow, Feb 27 '20 at 20:34

score 2 · Answer 1 · answered Feb 27 '20 at 22:30

You need to count opened and closed brackets for each regex match and only insert the newline, if there are more openend than closed brackets. This is done in replacement() which is called on each match of the regex. The regex searches for "(" and ")" just for counting, and for ";" to leave it or insert newline

import re

def replacement(matched_list):
    global bracket_count
    matched_char=matched_list.group(1)
    if "(" in matched_char:
        bracket_count += 1
        # don't replace, just return what was found
        return matched_char 
    elif ")" in matched_char:
        bracket_count -= 1
        # don't replace, just return what was found
        return matched_char 
    elif ";" in matched_char:
        # if we're inside brackets, insert \n
        if bracket_count == 0:
            return ';\n'
        # if not, leave it intact
        else:
            return ';'

# example 1
bracket_count=0
code="for (j=0; j<len; j++) a = (s) + (4); test = 5;"
new_code = re.sub('([();] ?)', replacement, code)
print(code)
print(new_code)

# example 2
bracket_count=0
code="for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;"
new_code = re.sub('([();])', replacement, code)
print(code)
print(new_code)

# example 3
bracket_count=0
code="for (j=0; j<len; j++) test = 5; a = (s) + (4);"
new_code = re.sub('([();])', replacement, code)
print(code)
print(new_code)

Result:

for (j=0; j<len; j++) a = (s) + (4); test = 5;
for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;
for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

I would remove all the *return* statements from the *if* content, instead do a single `return matched_char` at the end of `replacement`. Make an exception of `bracket_count == 0` since that does change the input. https://repl.it/repls/VivaciousRepulsivePyramid — 3limin4t0r, Feb 28 '20 at 02:21
@3limin4t0r: nope, the " single 'return matched_char' at the end of replacement" wouldn't remove the space after the ; as the "return ';'" does. — 2d4d, Feb 28 '20 at 22:06
*"I want to insert `\n` after semicolons `;` except in parenthesis using python code regex module."* The question doesn't say anything about removing spaces. Those are just assumptions. — 3limin4t0r, Feb 28 '20 at 23:31

score 1 · Accepted Answer · answered Feb 27 '20 at 21:37

Use a custom replacement function:

re.sub(pattern, repl, string, count=0, flags=0)
...
If repl is a function, it is called for every non-overlapping occurrence of pattern.

The function repl is called for every occurrence of a single ; and for parenthesized expressions. Since re.sub does not find overlapping sequences, the very first opening parenthesis will trigger a full match all the way up to the last closing parenthesis.

import re

def repl(m):
    contents = m.group(1)
    if '(' in contents:
        return contents
    return ';\n'

str1 = 'for (j=0; j<len; j++) a = (s) + (4); test = 5;'
str2 = 'for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;'

print (re.sub (r'(;\s*|\(.*\))', repl, str1))
print (re.sub (r'(;\s*|\(.*\))', repl, str2))

Result:

for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

Mission accomplished, for your (very little) sample data.

But wait!

A small – but valid – change in one of the examples

str1 = 'for (j=0; j<len; j++) test = 5; a = (s) + (4);'

breaks this with a wrong output:

for (j=0; j<len; j++) test = 5; a = (s) + (4);

There is no way around it, you need a state machine instead:

def state_match (text):
    parentheses = 0
    drop_space = False
    result = ''
    for character in text:
        if character == '(':
            parentheses += 1
            result += '('
        elif character == ')':
            parentheses -= 1
            result += ')'
        elif character == ' ':
            if not drop_space:
                result += ' '
            drop_space = False
        elif character == ';':
            if parentheses:
                result += character
            else:
                result += ';\n'
                drop_space = True
        else:
            result += character
    return result

str1 = 'for (j=0; j<len; j++) a = (s) + (4); test = 5;'
str2 = 'for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;'
str3 = 'for (j=0; j<len; j++) test = 5; a = (s) + (4);'

print (state_match(str1))
print (state_match(str2))
print (state_match(str3))

results correctly in:

for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

for (j=0; j<len; j++) test = 5;
a = (s) + (4);

How to get semicolons except in parentheses with regex

2 Answers2