0

I am trying to parse a beautified JavaScript file with huge functions. What I am trying to do is to separate each function into a match object to then process them individually to do other things.

An example could be:

__d(function(e, t, n, r, i, l, a) {

//AnyCharacters

}, 93, [27, 38, 40, 37, 94, 98, 99, 32]);

I am trying the following regex:

(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;

For more context I am trying to write each function to a file after some more proccessing:

functions_sep_regex = re.compile(r'(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;')

functions_sep = functions_sep_regex.finditer(res)

for functions in functions_sep:
        # Do something with functions.group(0))

The problem with the backtracking is the first (.*?) as I am trying to get any character between the start and the end of the function.

The regular expression must backtrack, it is the expected, as it is trying to match any character (even new line characters) but due to this error the engine crashes.

Is there a way to avoid this "crash"?

EDIT:

Reproducible example: pastebin.com/PcdSWnWG

LaiKash
  • 1
  • 1
  • I think you need to give us a reproducible example. [This particular example happens to be working](https://rextester.com/FMS23291). – Tim Biegeleisen Sep 27 '21 at 15:44
  • @Jan I saw that in there, it had a smell, but I didn't dare answer because I couldn't reproduce it in Python. You may give an answer if you wish. – Tim Biegeleisen Sep 27 '21 at 16:16
  • Try `(?s)__d\(function\(\w+(,\s+\w+)*\)\s\{([^{}]*)\},\s\d(.*?)\d\]\);` ([demo](https://regex101.com/r/ZNuLVQ/1)). – Wiktor Stribiżew Sep 27 '21 at 20:43
  • @WiktorStribiżew it didn't get all the matches. There are 6 functions that starts with ```__d(function(g, r, i, a, m, e, d)``` and that expression got only 2 matches. – LaiKash Sep 28 '21 at 09:56
  • Ok, 6 matches: https://regex101.com/r/ZNuLVQ/2 – Wiktor Stribiżew Sep 28 '21 at 10:19
  • With biggest files it times out: https://regex101.com/r/swkpLn/1 Maybe I am trying a quite heavy thing and I need to parse it byte by byte instead of having a heavy regex? @WiktorStribiżew – LaiKash Sep 28 '21 at 11:17
  • That is of course true, if you have a code file, you should always consider a dedicated parser rather than long single regexps. Also, did you try the regex above with `re` only? Try with PyPi `regex`. Install with `pip install regex` and retry, it is much more stable. – Wiktor Stribiżew Sep 28 '21 at 11:28
  • I will try PyPi regex as it supports Atomic grouping... But yes, I will need a dedicated parser. Do you know any JS parser in Python to extract the functions? @WiktorStribiżew – LaiKash Sep 28 '21 at 11:46
  • See [JavaScript parser in Python](https://stackoverflow.com/q/390992/3832970), there are some good hints. I never parsed JS in Python. – Wiktor Stribiżew Sep 28 '21 at 12:11

1 Answers1

1

The problem is not the (.*?) but the nested quantifiers:

functions_sep_regex = re.compile(r'(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;')
#                                                           ^^^

This group is likely to explode as the regex engine wants to report a match.
Either use ++ (possessive) or rephrase this part of your expression.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    OK...but you should also give the correct pattern. Just telling the OP that it's wrong is not an actual solution `:-)` – Tim Biegeleisen Sep 27 '21 at 16:17
  • @TimBiegeleisen: Completely agree with you - however I'd need some more input examples. Let's wait what the OP says - if anything. – Jan Sep 27 '21 at 16:19