1

There is a test String:

Module([Assign([Name('a', Store())], Num(2)), Assign([Name('b', Store())], Num(3)), Assign([Name('c', Store())], Str('Hello')), Assign([Name('x', Store())], BinOp(Name('a', Load()), Add(), Name('b', Load()))), Assign([Name('x', Store())], Name('a', Load())), Expr(Call(Name('print', Load()), [Name('a', Load())], [], None, None)), For(Name('i', Store()), Call(Name('range', Load()), [Num(10)], [], None, None), [Expr(Call(Name('print', Load()), [Name('a', Load())], [], None, None))], [])])

I am trying to get all loaded variable names from it. My regexp is

[a-z]+(?=', Load)

Result of it is the following: result of regex As you can see it also finds built-in modules such as print, range. How to exclude them? Values to be excluded are preceded by

Call(Name(' 

I tried

 (?=Call\(Name\(')[a-z]+(?=', Load)

but it did not work out.

My code is:

import re

test = '''Module([Assign([Name('a', Store())], Num(2)), Assign([Name('b', Store())], Num(3)), Assign([Name('c', Store())], Str('Hello')), Assign([Name('x', Store())], BinOp(Name('a', Load()), Add(), Name('b', Load()))), Assign([Name('x', Store())], Name('a', Load())), Expr(Call(Name('print', Load()), [Name('a', Load())], [], None, None)), For(Name('i', Store()), Call(Name('range', Load()), [Num(10)], [], None, None), [Expr(Call(Name('print', Load()), [Name('a', Load())], [], None, None))], [])])'''
print(re.findall(r"[a-z]+(?=', Load)", test))
print(re.findall(r"(?=Call\(Name\(')[a-z]+(?=', Load) ", test))
Verbal_Kint
  • 1,366
  • 3
  • 19
  • 35
techkuz
  • 3,608
  • 5
  • 34
  • 62
  • Please show your full code. – Nabin May 08 '17 at 05:42
  • You shouldn't use regex for this. It is already in the most optimal form in terms of being able to walk it to find any elements you are looking for. While the answers here may accurately extract the data you seek from the specific example provided, no regular expression will reliably be able to do so. Check my answer below. – Verbal_Kint May 08 '17 at 06:54

4 Answers4

3

Use a lookbehind and word boundary.

(?<!Call\(Name\(')\b\w+\b(?=', Load)

See demo.

https://regex101.com/r/hdxlQ8/1

vks
  • 67,027
  • 10
  • 91
  • 124
1

A negative lookbehind:

(?<!Call\(Name\()'(\w+)(?=', Load)

or

(?<!Call\()Name\('(\w+)', Load
handle
  • 5,859
  • 3
  • 54
  • 82
1

I have used eval() method for this. I don't recommend this way but you can use this as an alternative.

Here test is the variable which has the long string. And filtered variable has your desired list of values.

all = (re.findall(r"[a-z]+(?=', Load)", test))

filtered = []
for each in all:
    try:
        eval(each)
    except NameError:
        filtered.append(each)
    except:
        pass

print filtered

Output:

['a', 'b', 'a', 'a', 'a']

We try to execute each string using eval() method. If there is no any variable, method or class with that name, the python interpretor will throw NameError Exception suggesting this is not a method or variable and hence we are appending/adding the strings to the filtered list.

PS. Anyother exceptions like TypeError are passed.

Nabin
  • 11,216
  • 8
  • 63
  • 98
1

This looks like a parse tree. I would not use regex for this for countless reasons, much better explained by others in some pretty famous posts (granted that post uses [x]html but the lesson remains, do not use regular expressions to parse more complex grammars).

My understanding is that ASTs and in this case the actual concrete parse tree use Context Free Grammars and therefore aren't regular and cannot be reliably parsed using regular expressions. Plus, that code is already in a pretty convenient state in terms of walk-ability. If anything recreate the objects and walk the tree while knowing the rule that variables names will be the left side terminal of the Assign statement, with its value on the right. This will certainly take less time and cause less headache than using regex.

Do yourself a favor and do not attempt this with regex unless you are dealing with a small, known variety of these.

For further reading.

Community
  • 1
  • 1
Verbal_Kint
  • 1,366
  • 3
  • 19
  • 35
  • thank you very much. Yes, it is indeed an AST. I am trying using the built-in Python AST module to find which variables are initialized (e.g. a = 4) but not used in the code. My idea was to get this information from AST dump, as used variables are marked there as loaded.. – techkuz May 08 '17 at 07:23
  • @trthhrtz I haven't used the AST module but I would imagine it must provide a way to walk the tree, which is probably what you want – Verbal_Kint May 08 '17 at 07:29
  • @trthhrtz also, this sounds like something a good linter would handle. There must be a python package of some sort that does that, whether you want the functionality or just to peek under the hood at how others have approached this problem – Verbal_Kint May 08 '17 at 07:35