2

I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.

eg using

(\w+)\([^\(\)]+\)

with

a(b(1 + 2 * 2), c(3,4))

I get the following: (as expected)

b, 1 + 2 * 2
c, 3,4

where I would need

a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4

Any suggestions?

Thanks for your time...

lmount
  • 1,282
  • 11
  • 9
  • I don't think you will be able to do this using only regular expressions. As their name suggests, they are only capable of handling *regular* expressions; I'm not up enough on language theory to know where Fortran falls, but my guess is you'll need a skeleton parser to do this right. – kquinn Mar 12 '09 at 08:55

5 Answers5

2

I don't think this is a job for regular expressions... they can't really handle nested patterns.

This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.

Community
  • 1
  • 1
David Z
  • 128,184
  • 27
  • 255
  • 279
2

This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.

It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.

Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as @Devin Jeanpierre suggests.

Community
  • 1
  • 1
cdleary
  • 69,512
  • 53
  • 163
  • 191
2

It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.

Devin Jeanpierre
  • 92,913
  • 4
  • 55
  • 79
  • Right, you can tokenize and use your own state machine you can do it, but that's technically not just using regular expressions. – cdleary Mar 12 '09 at 09:17
  • Thing is that I didn't see "just regex" in the question, only "regex". The re module includes a Scanner, and has for ages-- not that it's documented (bleh). – Devin Jeanpierre Mar 12 '09 at 09:23
  • Oooh cool, undocumented features! +1 from me for telling me to look. :-) – cdleary Mar 12 '09 at 09:51
2

You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.

Paolo Tedesco
  • 55,237
  • 33
  • 144
  • 193
1

You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the function, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Andrea Ambu
  • 38,188
  • 14
  • 54
  • 77