Python re.findall regex and text processing

Question

I'm looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A.

So far I tried selecting them with re.findall(r"\bconvert\b\(.*?,.*\)", l, re.IGNORECASE) But it's only returning a small selection out of what I want and I also have trouble actually manipulating the A/B I mentioned.

For example, a sample line (note the nested structure here is irrelevant, I'm only getting the outer layer working if possible)

convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath

...should become...

'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath

Example2:

SELECT LocationID AS ItemId, convert(bigint, -1),

...should become...

SELECT LocationID AS ItemId, -1::bigint,

I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:

matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
    print("convert() line")
    #line = re.sub(re.escape(a) + r'', '', line)

Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.

I see the convert can be nested, if so, regex will not work here. — Bharel, Jul 19 '22 at 21:10
I seem to have missed one space here: `|| ':'||`. It should have been:`|| ':' ||` — Иван Балван, Jul 19 '22 at 21:16
@Bharel: yes it can, if we need to handle up to (say) N=3 levels of nesting, we just process the line and apply the regex 3 times. — smci, Jul 19 '22 at 21:36
Related: [Matching Nested Structures With Regular Expressions in Python](https://stackoverflow.com/questions/1099178/matching-nested-structures-with-regular-expressions-in-python) — smci, Jul 19 '22 at 21:39
Yeah If I need to nest this I can simply run it multiple times. Right now I'm still just looking to get the outer layer working. — neveratdennys, Jul 19 '22 at 21:42
Can you please show us a *minimal* example of input that causes your regex to fail? (I think drilling down to this will help you fix it.) — smci, Jul 19 '22 at 22:08
I'm making attempts based on the Answer below and this is where I'm at. `import re string=" WHERE le.LocPath not like '%/' || convert(nvarchar, es.Item_ID) || ':%'" before,start,mid_1,mid_2,end=re.search(r'''(.*)(\w+?\()(.+)(?<=\)),(.+)(\).*)''',string,re.X).groups() result=before+mid_2.lstrip()+':: '+mid_1+end` Sorry about the formatting, I'm not sure how to make it look better in comments. — neveratdennys, Jul 19 '22 at 22:44
Still feeling mixed up about this one, the first block that proceeds the "convert()" function is optional and may not always exist, I'm having a lot of trouble picking that up on top of the syntax from the answer. Adding a block before the first one would also ruin the first example case by wrongly getting "convert" in the first block. @ИванБалван — neveratdennys, Jul 19 '22 at 22:54
@smci First you assume the nesting levels, second of all, applying a regex multiple times will not achieve anything - you have to nest it by yourself. Since you don't have a target nesting level, what you're asking for is impossible. — Bharel, Jul 19 '22 at 22:56
@Bharel I thought the nesting isn't important here for a few reasons, it comes up very rarely so it's actually fine to not touch nesting, also because running it each time would remove the outer target structure (e.g. convert(A, convert(B,C)) => convert(B,C)::A). If it's like this, simply running the same regex again would be sufficient right (convert(B,C)::A => C::B::A)? — neveratdennys, Jul 19 '22 at 23:03
@neveratdennys Not really, you have to detect and track the opening and closing brackets, something a regex can't do for an unknown nested structure. It's [theoretically impossible](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) for a regular expression. — Bharel, Jul 19 '22 at 23:07
@Bharel I see. I'll change my description to specify that I don't want nested structures to be touched. Appreciate the input! — neveratdennys, Jul 19 '22 at 23:08
@Bharel: I'm not "asking for" anything, the OP first mentioned nesting, gave a nested example, then (after my comment), edited it to remove all mention of it. And any finite block of code will (self-evidently) have a finite nesting limit, I merely gave an example for the case N=3 (OP only asked about N=2 anyway, then changed their mind). And my reaction as a person who has occasionally needed to sink hours/days into debugging problem regexes, was that a pragmatic hack sometimes solves the specific subcase of the problem, and saves time. We're not writing a canonical LR(1) compiler. — smci, Jul 20 '22 at 22:24

Иван Балван · Answer 1 · 2022-07-20T15:44:01.600

1

The task:

Swap the parameters of all the 'convert' functions in the given string. Parameters can contain any character, including nested 'convert' functions.

A solution based on the re module:

def convert_re(s):
    import re
    start,part_1,part_2,end=re.search(r'''
                               (.*?)   
                               convert\(
                               ([^,)(]+\(.+?\)[^,)(]*|[^,)(]+)
                               ,
                               ([^,)(]+\(.+?\)[^,)(]*|[^,)(]+)
                               \)
                               (.*)                                     
                                       ''',s,re.X).groups()


    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_re(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_re(s)
    return s

Discription of the 'convert_re' function:

Regular expression:

start is the first group with what comes before 'convert'
Then follows convert\() which has no group and contains the name of the function and the opening '('
part_1 is the second group ([^,)(]+\(.+?\)[^,)(]*|[^,)(]+). This should match the first parameter. It can be anything except - ,)(, or a function preceded by anything except ,)(, optionally followed by anything except ,)( and with anything inside (except a new line)
Then follows a comma ,, which has no group
part_2 is the third group and it acts like the second, but should catch everything what's left inside the external function
Then follows ), which has no group
end is the fourth group (.*) with what's left before the new line.

The resulting string is then created by swapping part_1 and part_2, putting ' :: ' between them, removing spaces on the left from part_2 and adding start to the beginning and end to the end.

Description of the 'multi_convert_re' function

Repeatedly calls 'convert_re' function until there are no "convert" left.

Notes:

N.B.: The code implies that the 'convert' function in the string has exactly two parameters.
The code works on the given examples, but I'm afraid there may still be unforeseen flaws when it comes to other examples. Please tell, if you find any flaws.
I have provided another solution presented in another answer that is not based on the re module. It may turn out that the results will be different.

edited Jul 20 '22 at 15:44

answered Jul 19 '22 at 22:10

Иван Балван

575
3
6

wow that's pretty impressive I appreciate it. There are other cases `SELECT LocationID AS ItemId, convert(bigint, -1),`, here's another one ` WHERE le.LocPath not like '%/' || convert(nvarchar, es.Item_ID) || ':%' ` which isn't picked up by this yet. Also the result should be without the `start+` in the beginning. – neveratdennys Jul 19 '22 at 22:22
It looks like the other cases not working is just having other text before the function name (e.g. textext convert(A,B) texttext), I'll add another group before start to cover for that and see if that works. `before,start,mid_1,mid_2,end=re.search(r'''(.*)(\w+?\()(.+)(?<=\)),(.+)(\).*)''',string,re.X).groups() result=before+mid_2.lstrip()+':: '+mid_1+end` It looks like that's not quite working still. – neveratdennys Jul 19 '22 at 22:38
Make the first group 'before' non-greedy, add '?' after '*' : `(.*?)` – Иван Балван Jul 19 '22 at 23:02
Looks like that's still not picking up the new case, but that does fix the issue I mentioned about breaking the original case. – neveratdennys Jul 19 '22 at 23:07
Also, this regular expression will match only if the first parameter is a function and in a string ends with ')'. This is a rather narrow search... If there may be other cases where the first parameter is not a function, there should be another regular expression – Иван Балван Jul 19 '22 at 23:08
I see, I thought this is a very particular request and honestly I would be satisfied as long as this picked up a reasonable portion of the conversions I'm doing. The rest I can handle by hand, thank you! – neveratdennys Jul 19 '22 at 23:14
It must match without the 'before' group if what is before is separated from the function name by a character other than '\w' – Иван Балван Jul 19 '22 at 23:27
This should work if the first parameter is not just a function: re.search(r'''(\w+?\() (.+?)(?<!\(),(.+) (\).*)''',string,re.X) – Иван Балван Jul 19 '22 at 23:39
That's my bad, this first parameter in convert() is actually almost always just a word. I've changed the post to reflect that. Your expression seem to be missing a bracket somewhere, I'm not super sure which group it's meant to be if you can clarify. – neveratdennys Jul 19 '22 at 23:49
`before,start,mid_1,mid_2,brac,end=re.search(r'''(.*?)(\w+?\()(.+)(?<=),(.+)(\))(.*)''',string,re.X).groups() result=before+mid_2.lstrip()+':: '+mid_1+end` This seem to work for a lot of the cases, I think I understand it better now, but this seem to give some false positives and mess it up. – neveratdennys Jul 20 '22 at 00:07
Yes, indeed, "\" was missing in my last example. The correct: `re.search(r'''(\w+?\()(.+?)(?<!\(),(.+)(\).*)''',string,re.X)` – Иван Балван Jul 20 '22 at 00:25
or simply :`re.search(r'''(\w+?\()(.+?),(.+)(\).*)''',string,re.X)` That will fail if the first parameter is a function and have more than one parameters, separated by commas. I'm working on it) – Иван Балван Jul 20 '22 at 00:32
This should handle cases where the first parameter is a function with more than one parameters separated by commas: `re.search(r'''(\w+?\()([\w]+\(.+?\)|[\w]+),(.+)(\).*)''',string,re.X)` – Иван Балван Jul 20 '22 at 01:05
But this will fail if the first parameter is a function that has another function as a parameter. There also may be problems with other parts - the second part of the function and the ending 'as LocPath'. However, I will change my original answer to this last option, as it seems better. – Иван Балван Jul 20 '22 at 01:12
I would prefer my answer not to be accepted, it doesn't quite fit and has its drawbacks. It just occurred to me what you were trying to do - swap the parameters in all the 'convert' functions. There must be a solution. – Иван Балван Jul 20 '22 at 02:47
I found an error in the code and updated the answer... again – Иван Балван Jul 20 '22 at 02:59
I changed the answer, now it contains two functions. One of which should be able to work with nested "convert"s. Please try it and tell me if it works. – Иван Балван Jul 20 '22 at 06:58
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246620/discussion-between---and-neveratdennys). – Иван Балван Jul 20 '22 at 15:45

bob · Answer 2 · 2022-07-20T03:17:18.793

1

This may be an X/Y problem, meaning you’re asking how to do something with Regex that may be better solved with parsing (meaning using/modifying/writing a SQL parser). An indication that this is the case is the fact that “convert” calls can be nested. I’m guessing Regex is going to be more of a headache than it’s worth here in the long run if you’re working with a lot of files and they’re at all complicated.

edited Jul 20 '22 at 03:17

answered Jul 20 '22 at 03:08

bob

419
4
11

1

Yeah I'm not expecting a solution to cover all potential cases, and regex may have limitations given a lot of variation between the parameters here. I'll be testing with a different strategy and see if that works easier. – neveratdennys Jul 20 '22 at 18:15

score 1 · Answer 3 · answered Jul 20 '22 at 15:30

The task:

Swap the parameters of all the 'convert' functions in this given. Parameters can contain any character, including nested 'convert' functions.

A solution:

def convert_py(s):
    #capturing start:
    left=s.index('convert')
    start=s[:left]
    #capturing part_1:
    c=0
    line=''
    for n1,i in enumerate(s[left+8:],start=len(start)+8):
        if i==',' and c==0:
            part_1=line
            break
        if i==')':
            c-=1
        if i=='(':
            c+=1
        line+=i
    #capturing part_2:
    c=0
    line=''
    for n2,i in enumerate(s[n1+1:],start=n1+1):
        if i==')':
            c-=1
        if i=='(':
            c+=1
        if c<0:
            part_2=line
            break
        line+=i
    #capturing end:
    end=s[n2+1:]
    #capturing result:
    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_py(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_py(s)
    return s

Notes:

Unlike the solution based on the re module, which is presented in another answer - this version should not fail if there are more than two parameters in the 'convert' function in the given string. However, it will swap them only once, for example: convert(a,b, c) --> b, c : a
I am afraid that unforeseen cases may arise that will lead to failure. Please tell if you find any flaws

score 0 · Accepted Answer · answered Jul 22 '22 at 17:11

Here's my solution based on @Иван-Балван's code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I'll be using this method for a lot of other operations as well.

# Check for balanced brackets
def checkBracket(my_string):
    count = 0
    for c in my_string:
        if c == "(":
            count+=1
        elif c == ")":
            count-=1
    return count


# Modify the first convert in line
# Based on suggestions from stackoverflow.com/questions/73040953
def modifyConvert(l):
    # find the location of convert()
    count = l.index('convert(')

    # select the group before convert() call
    before = l[:count]

    group=""
    n1=0
    n2=0
    A=""
    B=""
    operate = False
    operators = ["|", "<", ">", "="]
    # look for A group before comma
    for n1, i in enumerate(l[count+8:], start=len(before)+8):
        # find current position in l
        checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
        if i == ',' and checkIndex == 0:
            A = group
            break
        group += i

    # look for B group after comma
    group = ""
    for n2, i in enumerate(l[n1+1:], start=n1+1):
        checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
        if i == ',' and checkIndex == 0:
            return l
        elif checkIndex < 0:
            B = group
            break
        group += i
        
        # mark operators
        if i in operators:
            operate = True

    # select the group after convert() call
    after = l[n2+1:]

    # (B) if it contains operators
    if operate:
        return before + "(" + B.lstrip() + ') :: ' + A + after
    else:
        return before + B.lstrip() + '::' + A + after


# Modify cast syntax with convert(a,b). return line.
def convertCast(l):

    # Call helper for nested cases
    i = l.count('convert(')
    while i>0:
        i -= 1
        l = modifyConvert(l)

    return l

Python re.findall regex and text processing

4 Answers4