0

I'd like to parse define statements in a PHP file using a Python regex. (Or in other words: I want to use Python to parse a PHP file.)

What I'd like to parse are define statements like this:

define("My_KEY", "My_Value l");
define('My_KEY', 'My_Value');
define(   'My_KEY'  ,    "My_Value"   );

So I came up with the following Python regex:

define\(\s*["']{1}(.[^'"]*)["']{1}\s*,\s*["']{1}(.[^'"]*)["']{1}\s*\)

This works great, as long as there is no use of a " or ' inside the define statement. For example something like this will not work:

define(   'My_KEY'  ,    'My\'_\'Value'   );
define(   'My_KEY'  ,    "My'_'Value"   );

Any ideas how to approach this problem?

Jongware
  • 22,200
  • 8
  • 54
  • 100
tony994
  • 485
  • 1
  • 5
  • 10
  • 1
    Is regex necessary for the whole task? You could use regex to find `define(..)` then split the string between the parens, and trim it, etc., to get the values you need. – Andy G May 14 '16 at 13:10
  • See http://stackoverflow.com/questions/1352693/how-to-match-a-quoted-string-with-escaped-quotes-in-it – Barmar May 14 '16 at 13:27
  • 1
    @AndyG yes I could, but I want to learn more about how to use regex, so that why I came up with the question. – tony994 May 14 '16 at 13:52
  • @Barmar thanks for the heads up – tony994 May 14 '16 at 13:52
  • @manuel fair enough (from the answers you can see why I suggested taking two stages to do this ;)) – Andy G May 14 '16 at 14:09

4 Answers4

1

You can use something like:

import re
result = re.findall(r"""^define\(\s*['"]*(.*?)['"]*[\s,]+['"]*(.*?)['"]*\s*\)""", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)

Regex101 Demo and Explanation


Matches:

MATCH 1
1.  [8-14]  `My_KEY`
2.  [18-28] `My_Value l`
MATCH 2
1.  [40-46] `My_KEY`
2.  [50-58] `My_Value`
MATCH 3
1.  [73-79] `My_KEY`
2.  [88-96] `My_Value`
MATCH 4
1.  [114-120]   `My_KEY`
2.  [129-141]   `My\'_\'Value`
MATCH 5
1.  [159-165]   `My_KEY`
2.  [174-184]   `My'_'Value`
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
1

Use look arounds with this monster regex:

define\(\s*(["'])(?P<key>.+?(?=\1))\1\s*,
\s*(["'])(?P<value>.+?)(?=\3)(?<!\\)\3

See a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

In python,

str="define(   'My_KEY'  ,    'My\'_\'Value'   )";
import re
re.sub(r"""^define\(\s*['"]*(.*?)['"]*[\s,]+['"]*(.*?)['"]*\s*\)""",r'\2 ; \1', str)

Output :

"My'_'Value ; My_KEY"
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
0

Description

This regex will do the following:

  • match all lines that start with define and have a key and value set inside parentheses
  • capture the key and value strings, without including the wrapping quotes
  • all key and value to be wrapped in single or double quotes
  • correctly handle escaped quotes
  • avoid difficult edge cases like:
    • define( 'file path', "C:\\windows\\temp\\" ); where an escaped slash may exist before a closing quote

The Regex

Note: using the following flags: case-insensitive, global, multiline

^define\(\s*(['"])((?:\\\1|(?:(?!\1).))*)\1\s*,\s*(['"])((?:\\\3|(?:(?!\3).))*)\3\s*\);

Regular expression visualization

Capture groups

  • capture group 0 gets the entire string
  • capture group 1 gets the quote type surrounding the key
  • capture group 2 gets the key string inside the quotes
  • capture group 3 gets the quote type surrounding the value
  • capture group 4 gets the value string inside the quotes

Examples

Live Demo

https://regex101.com/r/oP4sV0/1

Sample Text

define("0 My_KEY", "0 My_Value l");
define('1 My_KEY', '1 My_Value');
define(   '2 My_KEY'  ,    "2 My_Value"   );
define(   '3 My_KEY\\'  ,    '3 My\'_\'Value'   );
define(   '4 My_KEY'  ,    "4 My'_'Value\\"   );

Sample Matches

[0][0] = define("0 My_KEY", "0 My_Value l");
[0][1] = "
[0][2] = 0 My_KEY
[0][3] = "
[0][4] = 0 My_Value l

[1][0] = define('1 My_KEY', '1 My_Value');
[1][1] = '
[1][2] = 1 My_KEY
[1][3] = '
[1][4] = 1 My_Value

[2][0] = define(   '2 My_KEY'  ,    "2 My_Value"   );
[2][1] = '
[2][2] = 2 My_KEY
[2][3] = "
[2][4] = 2 My_Value

[3][0] = define(   '3 My_KEY'  ,    '3 My\'_\'Value'   );
[3][1] = '
[3][2] = 3 My_KEY\\
[3][3] = '
[3][4] = 3 My\'_\'Value

[4][0] = define(   '4 My_KEY'  ,    "4 My'_'Value"   );
[4][1] = '
[4][2] = 4 My_KEY
[4][3] = "
[4][4] = 4 My'_'Value\\

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of a "line"
----------------------------------------------------------------------
  define                   'define'
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      \1                       what was matched by capture \1
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          \1                       what was matched by capture \1
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \1                       what was matched by capture \1
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  ,                        ','
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      \3                       what was matched by capture \3
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          \3                       what was matched by capture \3
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  \3                       what was matched by capture \3
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
  ;                        ';'
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43