3

I'm trying to use RegEx within Python to parse out a function definition and NOTHING else. I keep running into problems though. Is RegEx the right tool to be using here?

i.e.

def foo():
  print bar
-- Matches --

a = 2
def foo():
  print bar
-- Doesn't match as there's code above the def --

def foo():
  print bar
a = 2
-- Doesn't match as there's code below the def --

An example of a string I'm trying to parse is "def isPalindrome(x):\n return x == x[::-1]". But in reality that might contain lines above or below the def itself.

What RegEx expression would I have to use to achieve this?

eumiro
  • 207,213
  • 34
  • 299
  • 261
Strings
  • 133
  • 1
  • 2
  • 9

2 Answers2

9

No, regular expressions are not the right tool for this job. This is similar to people desperately trying to parse HTML with regular expressions. These languages are not regular. Thus you can't work around all quirks you will encounter.

Use the built-in parser module, build a parse tree, check for definition nodes and use them instead. It's even better to use the ast module as it is way more convenient to use. An example:

import ast

mdef = 'def foo(x): return 2*x'
a = ast.parse(mdef)
definitions = [n for n in ast.walk(a) if type(n) == ast.FunctionDef]
nemo
  • 55,207
  • 13
  • 135
  • 135
  • _"This is similar to people desperately trying to parse HTML with regular expressions. These languages are not regular."_ Similar.... One does a lot of thing with this word. But which languages do you speak of ? For me, your assertion sounds as a blurring catechism – eyquem Mar 01 '13 at 13:31
  • Similar in the sense of that it is the same problem they want to tackle but a different language. The problem in general is described [here](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). A short explanation why regular expressions can't parse HTML (and therefore python) can be found [here](http://stackoverflow.com/a/590789/1643939). – nemo Mar 01 '13 at 13:49
  • It doesn't really have to parse it per-se, just recognise that there's text above or below it. Still recommend ast for that? – Strings Mar 01 '13 at 13:51
  • Yes. Why bother with parsing when there's a module which does everything for you including validation of the input so you don't have to catch those yourself? You still need to find that there's a function definition (the non-trivial part). – nemo Mar 01 '13 at 14:09
  • @nemo I'm rather convinced by your pleading, concerning **ast** module. I'm aware that modifying a regex's pattern to correct its flaws may be an endless activity if someone has the pretention to create the absolute regex that will catch ALL the function definitions that have existed and will exist until end of times. In this case, using the **ast** module is certainly more intelligent since it is a Python tool to analyze Python codes: this appears to me very coherent. That's different from employing BeautifulSoup to parse an HTML text, 'cause BS hasn't been created natively together with HTML – eyquem Mar 01 '13 at 16:00
  • @nemo So what I think is that for limited analysis of a code that is supposed to have not too much complications and particular cases, a regex may be fully acceptable and preferabel to **ast** for this task. As soon as the task appears to be too much thorough, the use of **ast** is certainly better. What I don't like, it is that refrain about the inadequacy of regexes to analyse this language, and this other one, and an other one again, and ... and ... and – eyquem Mar 01 '13 at 16:06
  • Of course there are situations when you can live with a regexp, the mentioned article states that as well. In this case it's easier and less error prone to use a full fledged parser (which is already there) to tackle this problem. The problem is that in this case problems can occur which boil down to the **inability** of regular expressions not being able to parse a non-regular language. One example would be to ignore commented functions. So in this case it is clearly better to use the **ast** module. – nemo Mar 01 '13 at 17:26
  • @Strings did this answer your question? Is it unclear? If the question is answered, you might want to accept it. – nemo Mar 02 '13 at 20:53
  • `ast` module has one huge drawback: it doesn't keep formatting and comments :( In some scenarios it is required. In this case I'd prefer next answer by @eyquem. Otherwise one can use `redbaron` or `libcst` libraries. – grundic Mar 01 '20 at 16:28
2
reg = re.compile('((^ *)def \w+\(.*?\): *\r?\n'
                 '(?: *\r?\n)*'
                 '\\2( +)[^ ].*\r?\n'
                 '(?: *\r?\n)*'
                 '(\\2\\3.*\r?\n(?: *\r?\n)*)*)',
                 re.MULTILINE)

EDIT

import re
script = '''
def foo():
  print bar

a = 2
def foot():
  print bar

b = 10
"""
opopo =457
def foor(x):


  print bar
  print x + 10
  def g(u):
    print

  def h(rt,o):
    assert(rt==12)
a = 2
class AZERT(object):
   pass
"""


b = 10
def tabulae(x):


\tprint bar
\tprint x + 10
\tdef g(u):
\t\tprint

\tdef h(rt,o):
\t\tassert(rt==12)
a = 2


class Z:
    def inzide(x):


      print baracuda
      print x + 10
      def gululu(u):
        print

      def hortense(rt,o):
        assert(rt==12)



def oneline(x): return 2*x


def scroutchibi(h%,n():245sqfg srot b#

'''

.

reg = re.compile('((?:^[ \t]*)def \w+\(.*\): *(?=.*?[^ \t\n]).*\r?\n)'
                 '|'
                 '((^[ \t]*)def \w+\(.*\): *\r?\n'
                 '(?:[ \t]*\r?\n)*'
                 '\\3([ \t]+)[^ \t].*\r?\n'
                 '(?:[ \t]*\r?\n)*'
                 '(\\3\\4.*\r?\n(?: *\r?\n)*)*)',
                 re.MULTILINE)

regcom = re.compile('("""|\'\'\')(.+?)\\1',re.DOTALL)


avoided_spans = [ma.span(2) for ma in regcom.finditer(script)]

print 'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee'
for ma in  reg.finditer(script):
    print ma.group(),
    print '--------------------'
    print repr(ma.group())
    print
    try:
        exec(ma.group().strip())
    except:
        print "   isn't a valid definition of a function"
    am,bm = ma.span()
    if any(a<=am<=bm<=b for a,b in avoided_spans):
        print '   is a commented definition function' 

    print 'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee'

result

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def foo():
  print bar

--------------------
'def foo():\n  print bar\n\n'

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def foot():
  print bar

--------------------
'def foot():\n  print bar\n\n'

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def foor(x):


  print bar
  print x + 10
  def g(u):
    print

  def h(rt,o):
    assert(rt==12)
--------------------
'def foor(x):\n\n\n  print bar\n  print x + 10\n  def g(u):\n    print\n\n  def h(rt,o):\n    assert(rt==12)\n'

   is a commented definition function
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def tabulae(x):


    print bar
    print x + 10
    def g(u):
        print

    def h(rt,o):
        assert(rt==12)
--------------------
'def tabulae(x):\n\n\n\tprint bar\n\tprint x + 10\n\tdef g(u):\n\t\tprint\n\n\tdef h(rt,o):\n\t\tassert(rt==12)\n'

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    def inzide(x):


      print baracuda
      print x + 10
      def gululu(u):
        print

      def hortense(rt,o):
        assert(rt==12)



--------------------
'    def inzide(x):\n\n\n      print baracuda\n      print x + 10\n      def gululu(u):\n        print\n\n      def hortense(rt,o):\n        assert(rt==12)\n\n\n\n'

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def oneline(x): return 2*x
--------------------
'def oneline(x): return 2*x\n'

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
def scroutchibi(h%,n():245sqfg srot b#
--------------------
'def scroutchibi(h%,n():245sqfg srot b#\n'

   isn't a valid definition of a function
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eyquem
  • 26,771
  • 7
  • 38
  • 46
  • I see several problems with this. Using tabs for indentation doesn't work with this one. One-line definitions e.g. `def foo(x): return 2*x` don't work. Commented functions (surrounded with `"""`) still match. – nemo Mar 01 '13 at 14:18
  • I updated my regex to correct the flaws you evoked. Now it catches function definitions having tabs (but not a one that mix tabs and blanks) and one-liner definitions. - I don't understand the problem due to what you call commented functions ? I tried my code with ```"""`` surrounding one of the functions, and it is detected anyway. – eyquem Mar 01 '13 at 15:31
  • That's the problem. It shouldn't be detected. – nemo Mar 01 '13 at 17:23
  • @nemo Why ? It depends on what wants the OP, and we don't know in fact. - A question I wonder about is : what is the purpose of a commented function in a script ? - - And by the way, I've updated again my answer, now two regexes and conditions in the code allow to detect when a definition function is an invalid one AND when it is a commented one. – eyquem Mar 01 '13 at 18:21
  • Imagine OP is analyzing code before it gets checked into a VCS, a common thing to do. I've been told that there exist people which leave code commented and your regexp would have used this code. A false positive. I'm wondering why you're attempting to create this regular expression. By the way, `def oneline2 (x): return 4*x` doesn't seem to work. – nemo Mar 01 '13 at 18:38
  • @nemo Thank you for the VCS example. - I like your understatement _"I've been told that there exist people which leave code commented"_ ahahah... I know, but I wonder why to put an ENTIRE function between ``"""....."""``. Well, doesn't matters. - I attempted, first trying to answer to OP, secondly as a challenge, thirdly because I need to understand things to adhere and change my opinion. Now, you have explained me things and I studied the ``ast`` module, and I'm enthusiastic about it. – eyquem Mar 01 '13 at 20:58
  • @nemo It's a perfect tool, made by pythonistas to parse Python code, and it certainly does this better than I would ever do., or I would need long time before reaching all the knowing about the structure of a Python code required to write regexes that would do the same that is already available in ``ast``. So I am convinced now, for this precise case, one must learn and use ``ast``, it's the smart way to develop. Thank you for your advices. – eyquem Mar 01 '13 at 20:59
  • @nemo By the way, the OP asked an XY question. His problem is X and he thinks that using regex as an Y solution is the way to do. I tried to answer to Y , while you answered to X. - By the way, what do you mean saying that _"``def oneline2 (x): return 4*x`` doesn't seem to work"_ My regex detects it correctly, though – eyquem Mar 01 '13 at 21:03
  • Note the space between `oneline2` and the braces :) – nemo Mar 01 '13 at 21:15
  • 1
    Ah yes ! You mean that, as is, this valid function definition isn't catched by my regex and I should modify one more time my regex's pattern. OK. From change to change, I would reach in some indefinite time a regex as much valid as the tool that does the same work in ``ast``. It's easier to use the full fledged ``ast`` module, OK I agree. -BTW I upvoted just now your answer, and I will certainly upvote some more, when I will read your other answers , via your user's page. Thank you – eyquem Mar 01 '13 at 21:31