-1

I have a regular expression to find functions in files.

See how expression perfectly works in PHP

If I try to run same regex with grep from console, I get an error:

grep -rP "(_t\s*\(\s*([\'\"])(\d+)\2\s*,\s*([\'\"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))" application scripts library public data | sort -n | uniq

grep: unrecognized character after (?<

Looks like grep can't handle this part of regex (?<!\\) , which is important for me.

Can anyone advise how to modify regex to make grep work with it?

EDIT: String: _t('123', 'pcs.', '', $userLang) . $data['ticker'] . ' (' . $data['security_name'] . ')

Need to find:

  1. index in function ('123')

  2. text in function ('pcs.')

  3. function itself

    > _t('123', 'pcs.', '', $userLang)
    
Prosto Trader
  • 3,471
  • 3
  • 31
  • 52
  • 3
    That's not a very nice regex, is it? As you've discovered, The `!` character is significant to the shell within double quotes. Personally I'd go down the route of enclosing the whole thing in single quotes and then using `'"'"'` for each single quote in the regex. Either way, it would be useful if you could make your question self-contained by showing us the pattern you are trying to match here. – Tom Fenech Feb 17 '15 at 10:02
  • Don't you need the `-e` flag for extended regexen? – collapsar Feb 17 '15 at 10:05
  • 1
    Side note: the error you get is not thrown by `grep` but by Bash (look at the error: `-bash: !\: event not found`). An easy fix is to disable command history with `set +o history`. – gniourf_gniourf Feb 17 '15 at 10:07
  • when I run it from PHP with exec I still get error on that part "grep: missing )" – Prosto Trader Feb 17 '15 at 10:10
  • 1
    @collapsar that would be `-E` (at least on my grep) and here the OP is using `-P` with enables Perl regular expression (PCRE) support. – Tom Fenech Feb 17 '15 at 10:14
  • possible duplicate of ["Event not found" error for shell command in unix](http://stackoverflow.com/questions/10221835/event-not-found-error-for-shell-command-in-unix) – tripleee Feb 17 '15 at 10:36
  • @tripleee, it is not about events in shell, it's about regex in grep – Prosto Trader Feb 17 '15 at 11:05
  • @tripleee, I've edited error – Prosto Trader Feb 17 '15 at 11:06

2 Answers2

3

Doing what I said in the comments solves your problem (using the data from the link):

$ cat file
_t('123', 'шт.', '', $userLang)  . $data['ticker'] . ' (' . $data['security_name'] . ')
$ grep -P '(_t\s*\(\s*(['"'"'"])(\d+)\2\s*,\s*(['"'"'"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))' file
_t('123', 'шт.', '', $userLang)  . $data['ticker'] . ' (' . $data['security_name'] . ')

The trick here is to use single quotes around the whole regex, then whenever you want a single quote, do '"'"', which means "close the original string, add a single quote within double quotes, then open a new single-quoted string". Another alternative, as proposed by glglgl, would be to use '\'', i.e. close the original string, add an escaped ' and open a new string.

Using single quotes prevents bash from interpreting the ! as a history expansion. As gniourf_gniourf mentions above The other option would be to disable that behaviour, using set +o history.

Just as a suggestion, if you're looking to capture separate parts of the regex (and you're already using PCRE mode in grep), you could use Perl instead:

$ perl -lne '/(_t\s*\(\s*(['\''"])(\d+)\2\s*,\s*(['\''"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))/ && print "group 1: $1\ngroup 3: $3\n group 5: $5"' file
group 1: _t('123', 'шт.', '', $userLang)
group 3: 123
group 5: шт.
Community
  • 1
  • 1
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • 2
    Alternative to `'"'"'` is `'\''`. – glglgl Feb 17 '15 at 10:12
  • I might be missing something... grep -rP "(_t\s*\(\s*(['"'"'"])(\d+)\2\s*,\s*(['"'"'"])(.*?)(?<\)\4\s*(?(?=,)[^\)]*\s*\)|\)))" application scripts library public data | sort -n | uniq – Prosto Trader Feb 17 '15 at 10:17
  • grep: unrecognized character after (? – Prosto Trader Feb 17 '15 at 10:18
  • same error with single quotes... grep -rP '(_t\s*\(\s*(['"'"'"])(\d+)\2\s*,\s*(['"'"'"])(.*?)(?<\)\4\s*(?(?=,)[^\)]*\s*\)|\)))' application scripts library public data | sort -n | uniq – Prosto Trader Feb 17 '15 at 10:22
  • seems like trouble is not with [\'\"] part. The trouble is with (?<!\\) part of the regex. If I remove it, grep works fine with any combination of single and double quotes. – Prosto Trader Feb 17 '15 at 10:24
  • @Prosto In that case, I would suggest using Perl as I have shown. Does that work for you? – Tom Fenech Feb 17 '15 at 10:25
  • @TomFenech You are trying to emulate a stateful parser with a stateless regex. This cannot work reliably. – hek2mgl Feb 17 '15 at 10:38
  • @hek2mgl I'm not saying it can, though perhaps I should be more explicit in stating that I don't endorse the approach. As far as I'm concerned, the original issue was a shell scripting one (especially when the question itself contained no sign of any PHP), hence my answer. – Tom Fenech Feb 17 '15 at 10:57
  • @TomFenech I never before used perl to search for text in file in multiple directories. Not sure what to start with. – Prosto Trader Feb 17 '15 at 10:58
  • @TomFenech Sure, I should have said "OP tries to emulate ..." However, I think there are many "funny" edge cases like `$string = "\'function_call()\'";` or whatever complexity you can imagine of. A regex can't provide unlimited complexity. – hek2mgl Feb 17 '15 at 11:15
  • 2
    @Prosto I'm sorry but that is really beyond the scope of the original question. Your regex is working now, at least in a version of grep that supports PCRE mode. To search in multiple files, you will have to pass all of their names to Perl. How exactly you go about doing that depends on your shell and your directory structure at a mininum. – Tom Fenech Feb 17 '15 at 11:15
  • 1
    The double quotes also explain the unmatched parentesis error and related errors; within double quotes, `\\)` gets reduced to `\)`. – tripleee Feb 17 '15 at 11:22
0

I strongly recommend to use the tokenizer extension in order to parse PHP files. This is because parsing a programming language requires a stateful parser, a single regex is stateless and therefore cannot provide this.

Here comes an example how to extract function names from a PHP source file, tracking function calls is possible as well:

$source = file_get_contents('some.php');

$tokens = token_get_all($source);
for($i = 0; $i < count($tokens); $i++) {
    $token = $tokens[$i];
    if(!is_string($token)) {
        if($token[0] === T_FUNCTION) {
            // skip whitespace between the keyword 'function' 
            // and the function's name
            $i+=2;
            // Avoid to print the opening brackets of a closure
            if($tokens[$i][0] === T_STRING) {
                echo $tokens[$i][1] . PHP_EOL;
            }
        }
    }   
}

In comments you told that you also want to parse html, js files. I recommend a DOM/JS parser for that.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266