2

I am trying to create a regular expression that matches the operator ^ (xor) as long as it is acting as an operator between two strings and not beeing part of an string.

For example, having a file with these two lines:

'asdfasdf'; 'asdfasd'^'asdflkj';
['asdf', '^', 'asdf'];

only the first one should match as it is the only one where ^ is not part of an string. How can I make a regular expression to match the ^ when it is not inside an string ?

UPDATE: I am using egrep. I need a way to identify when ^ is part of an string or when it is not. My final objective is to find when the xor operator is being used against an string: something like ('[^']'\^.+|.+\^'[^']') but this matches the second line of my example.

So, it should match strings like:

  'asdf1524-sdfaA'^'sdfa322='
  'sdfa22_'^$myvar
  $myvar^'asAf34%'

BUT it should not match:

 ['+','*','^','%']
 '^'=>2
 "afa^sadfa"

UPDATE2: Added one more example to show why the proposed awk solutions do not work. I need to locate the ^ operator when operating with a single quoted string. I want to locate the number of occurences of this in a file and I want to add this check inside a bash script.

Thanks in advance!

dbranco
  • 129
  • 1
  • 9
  • 1
    It looks like you're trying to parse some code. What language are your files written in? I'm sure a regex can be crafted to work for most cases but I think that the most robust approach would be to parse the language properly. – Tom Fenech May 19 '15 at 11:01
  • Hi, thanks for your answer. I am trying to parse PHP code. – dbranco May 19 '15 at 11:07
  • 2
    I suspect this is beyond the capabilities of a regular expression. Look for a PHP parser to help you out. – glenn jackman May 19 '15 at 11:23
  • do not use comments to update your question. Instead, [edit] your question providing these details. – fedorqui May 19 '15 at 11:50
  • 1
    Thanks, I am an absolute noob on this. – dbranco May 19 '15 at 12:03

4 Answers4

1

What you wanna do is explicitly catch strings that might contain the ^ that you don't want to match and then discard that string. This is thoroughly explained here and with a JavaScript example here.

If you are using PCRE regexes you can utilize PCRE's (*SKIP)(*FAIL) options to discard the troublesome matches immediately, otherwise you'll have to capture them in a capture-group that you can then inspect and discard the entire match is the capture-group isn't empty.

This would be the PCRE way with Regex101 demo

(?:(['"])(?:(?!\1|\\).|\\.)*\1|\/\/[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\/)(*SKIP)(*FAIL)|\^

If you need to manually discard matches based on capture-groups, do this:

((['"])(?:(?!\2|\\).|\\.)*\2|\/\/[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\/)|\^

Regular expression visualization

See also the Debuggex Demo, where the ^'s you do wanna match are yellow denoting they aren't in a capture-group. All other matches have a capture-group and are highlighted darker in the Debuggex visual.

Note: I added support for /*...*/ and // comments, but neither of these account for Heredoc/nowdoc strings in PHP, don't know if this is important for you, you could add it fairly simple as another alternate match that should either be (*SKIP)(*FAIL)ed or captured and discarded.

Community
  • 1
  • 1
asontu
  • 4,548
  • 1
  • 21
  • 29
1

Just use awk with fields and a trivial regexp instead of grep with a complicated regexp, e.g. using all sample input suggested so far in this thread:

$ cat file
'asdfasdf'; 'asdfasd'^'asdflkj';                YES
['asdf', '^', 'asdf'];                          NO
''o'^'o''                                       NO
'asdf1524-sdfaA'^'sdfa322='                     YES
'sdfa22_'^$myvar                                YES
$myvar^'asAf34%'                                YES
['+','*','^','%']                               NO
'^'=>2                                          NO
'asdfa5A_sdf'; 'asd5A_fasd'^'asd5A_flkj';       YES
'asdfa5A_'^$var1;                               YES
$var2^'asdfa5A_';                               YES
'asdf', '^', 'asdf';                            NO
'+', '-', '*', '/', '^', '_');                  NO
'+'=>0,'-'=>0,'*'=>0,'/'=>0,'^'=>1);            NO
'+'=>0,'-'=>0,'*'=>1,'/'=>1,'_'=>1,'^'=>2);     NO
'+', '-', '*', '/', '^'))) {                    NO

$ awk -F"'" '{for (i=1;i<=NF;i+=2) if ($i ~ /\^/) {print; next}}' file
'asdfasdf'; 'asdfasd'^'asdflkj';                YES
'asdf1524-sdfaA'^'sdfa322='                     YES
'sdfa22_'^$myvar                                YES
$myvar^'asAf34%'                                YES
'asdfa5A_sdf'; 'asd5A_fasd'^'asd5A_flkj';       YES
'asdfa5A_'^$var1;                               YES
$var2^'asdfa5A_';                               YES

The above works by splitting each line at every ' into a series of fields so odd numbered fields are outside of pairs of quotes while even numbered fields are inside pairs of quotes (e.g. out'in'out'in'out) and then you just have to look for a ^ in an odd-numbered field.

This would need more work to deal with newlines and/or escaped quotes inside strings if that is a possibility but by then you really should be looking at a language parser instead of a shell script.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Something like so: ^[^^,]+?(?<!')'?\^'?(?!')[^^,]+?$ should do what you are after. An example is available here.

npinti
  • 51,780
  • 5
  • 72
  • 96
  • What about `''o'^'o''`? – Anonymous May 19 '15 at 11:04
  • Thanks for the answer. The code I posted was only an example. Actually I tried a similar approach: `'[^']+'\^'[^']+'` but for some reason, it does not work with the grep command. Beside, this example would not work with things like `$myvar^'asdfads123'`. I need something to match `^` when it is not part of an string. – dbranco May 19 '15 at 11:12
  • @dbranco: Can you please change your question to include some more samples? – npinti May 19 '15 at 11:22
  • Unfortunately it looks like I cannot edit my question (I think I dont have enough reputation :( ) – dbranco May 19 '15 at 11:26
  • @dbranco: Please include them as a comment then :) – npinti May 19 '15 at 11:26
  • Done, added some examples. Thanks for your help BTW :) – dbranco May 19 '15 at 11:33
  • @dbranco: I cannot really find a pattern with the data provided, and I do not have much time at hand. My recommendation would be to delete this question and ask a new one. In the new question, please provide some some samples you want to match and some you do not want to. Ideally, please include actual data. – npinti May 19 '15 at 11:45
  • I've provided more examples in the question. I dont know how explain it better: basically is to differe when ^ is inside a PHP string and when is not. I think that might require to count the leading `'` or `"`. – dbranco May 19 '15 at 12:05
  • @dbranco: What I am having a hard time grasping is: Is `['+','*','^','%']` an actual string or an array of strings, thus meaning that each sign will be interpreted on its own? – npinti May 19 '15 at 12:13
  • `['+','*','^','%']` is an example of ^ being inside an string with some string instances before and after. – dbranco May 19 '15 at 12:30
  • @dbranco: I've made an update, that being said, I am not 100% convinced. – npinti May 19 '15 at 12:53
0

I needed to work it in grep, so pcre do not work properly (even with pgrep). I eventually used an incredibly ugly and not-always-working regular expression:

^[^']*((('[^']*){1}|('[^']*){3}|('[^']*){5}|('[^']*){7}|('[^']*){9}|('[^']*){11})[^']+'\^.+|(('[^']*){0}|('[^']*){2}|('[^']*){4}|('[^']*){6}|('[^']*){8}|('[^']*){10})[^']+\^'.+)

This works for up to 5 strings declared before the operator and eventually compares [^']+\^'.+ or [^']+'\^.+. I know, I know... but it is the only way I found to make it work and of course only works for single quoted strings. It worked pefectly with this example file:

'asdfa5A_sdf'; 'asd5A_fasd'^'asd5A_flkj';
'asdfa5A_'^$var1;
$var2^'asdfa5A_';
'asdf', '^', 'asdf';
'+', '-', '*', '/', '^', '_');
'+'=>0,'-'=>0,'*'=>0,'/'=>0,'^'=>1); 
'+'=>0,'-'=>0,'*'=>1,'/'=>1,'_'=>1,'^'=>2); 
'+', '-', '*', '/', '^'))) {

Better solutions are welcome :). Thanks to everybody that helped me with this, specially to @npinti who spent a lot of time checking this!

dbranco
  • 129
  • 1
  • 9
  • Why do you need it to work in grep? All platforms that have grep also have awk and the solution is clear, simple, and robust in awk and will work for any number of strings (see http://stackoverflow.com/a/30326792/1745001). – Ed Morton May 19 '15 at 13:29