find newline with words starting with underscore with specific pattern

Question

I need to find the following from c code using regular expression python but some how i could not write it properly.

if(condition)
     /*~T*/
     {
        /*~T*/
        _getmethis = FALSE;
     /*~T*/
     }
..........
/*~T*/
     _findmethis = FALSE;
......
                    /*~T*/
_findthat = True;

I need to find all variables after /*~T/ starting with underscore and write to new file but my code could not find it i tried several regex pattern it is always empty output file

import re
fh = open('filename.c', "r")
output = open("output.txt", "w")
pattern = re.compile(r'(\/\*~T\*\/)(\s*?\n\s*)(_[aA-zZ]*)')
for line in fh:
for m in re.finditer(pattern, line):
    output.write(m.group(3))
    output.write("\n")

output.close()

`[aA-zZ]` [does not only match letters](https://stackoverflow.com/a/29771926/3832970), it also matches `[`, ``\``, `]`, `^`, `_`, `\``. You must have meant `[a-zA-Z]`. All you need to do is remove `for line in fh:` and use `re.finditer(pattern, fh.read())` — Wiktor Stribiżew, Nov 21 '18 at 16:55

Patrick Artner · Answer 1 · 2018-11-21T18:03:11.943

1

The reason you do not find anything is that your pattern crosses multiple lines but you are only looking at your file one line at a time.

Consider using this:

t = """
if(condition)
     /*~-*/
     {
        /*~T*/
        _getmethis = FALSE;
     /*~-*/
     }
..........
/*~T*/
     _findmethis = FALSE;

     /*~T*/
     do_not_findme_this = FALSE;
"""

import re
pattern = re.compile(r'/\*~T\*/.*?\n\s+(_[aA-zZ]*)', re.MULTILINE|re.DOTALL)
for m in re.finditer(pattern, t):  # use the whole file here - not line-wise
    print(m.group(1))

The pattern uses 2 flags that tell regex to use multiline matches and that dots . also match newlines (by default they don't) together with a non greedy .*? to make the gap between /*~-T*/ and the following group minimal large.

Printout:

_getmethis
_findmethis

Doku:

edited Nov 21 '18 at 18:03

answered Nov 21 '18 at 15:56

Patrick Artner

50,409
9
43
69

I am so silly of it that i always check the regex but not the python. I will try this – fastlearner Nov 21 '18 at 16:00
but this also finds the words if the underscore is in the middle of a variable – fastlearner Nov 21 '18 at 17:46
@fastlearner Then adjust the pattern? So the `(_[aA-zZ]*)` is only allowed after a newline and spaces? See edit ... if you want to play with regex, use http://regex101.com and put it to python mode - copy your text and pattern in it and modify it until it fits. Your example text did not contian any pattern "to be excluded" ... – Patrick Artner Nov 21 '18 at 18:05

score 1 · Accepted Answer · answered Nov 21 '18 at 18:25

You need to read the file in as a whole with fh.read() and make sure you amend the pattern to only match letters since [aA-zZ] matches more than just letters.

The pattern I suggest is

(/\*~T\*/)([^\S\n]*\n\s*)(_[a-zA-Z]*)

See the regex demo. Note that I deliberately subtracted \n from the first \s* to make matching more efficient.

When reading files in, it is more convenient to use with so that you do not have to use .close():

import re
pattern = re.compile(r'(/\*~T\*/)(\s*?\n\s*)(_[aA-zZ]*)')

with open('filename.c', "r") as fh:
    contents = fh.read()
    with open("output.txt", "w") as output:
        output.write("\n".join([x.group(3) for x in pattern.finditer(contents)]))

score 0 · Answer 3 · answered Nov 21 '18 at 20:33

This is my final version where i also try to avoid duplicates

import re
fh = open('filename.c', "r")
filecontent = fh.read() 
output = open("output.txt", "w")
createlist = []
pattern = re.compile(r"(/\*~T\*/)(\s*?\n\s*)(_[aA-zZ]*)")
for m in re.finditer(pattern, filecontent):
    if m.group(3) not in createlist:
        createlist.append(m.group(3))
        output.write(m.group(3))
        output.write('\n')
output.close()

find newline with words starting with underscore with specific pattern

3 Answers3