0

I want to create a RegEx that finds strings that begin and end in single or double quotes.

For example I can match such a case like this:

String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/

However, the problem occurs when quotes appear in the string itself like so:

String: "Hello" World"

We know the above expression will not work.

What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:

String: "Hello\" World"

Now I could come up with a long and complicated expression with various patterns in a group, one of them being:

RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/

However that to me seems excessive, and I think there may be a shorter and more elegant solution.

Intended syntax:

run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"

As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.

I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.

I have modified Rui Jarimba's example:

/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/

This now accounts pretty well for most cases, however there is one final case that can defeat this:

run -a "arg3 \" p2" "\"sa\"mple\"\\"

The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.

Flosculus
  • 6,880
  • 3
  • 18
  • 42
  • On what basis is the script supposed to know what quotes to change and witch ones to count as start/end parameters? – Peon Nov 21 '12 at 12:06
  • Depending on the outer quotes, yes, a larger string could contain more that one quoted nested string, so the reg ex would have to be able to find them all. If a nested string is encapsulated with single quotes, then any inside double quotes need not be escaped, and visa versa. – Flosculus Nov 21 '12 at 12:10
  • So you are looking for all text between the `first` and `last` quotes? – Peon Nov 21 '12 at 12:11
  • Think of it much like trying to find all strings in an SQL query. Any dual double quotes (which tells the SQL engine that this is an escaped double quote) are counted as part of the string. The same principle needs to be applied here, but with a backslash. This solution wont be applied to SQL statements however, Im actually trying to create a command-line argument parser. – Flosculus Nov 21 '12 at 12:14
  • @Flosculus In fact, the right solution is much more complicated than that. Take a look at [this question](http://stackoverflow.com/questions/13360870/how-can-i-adapt-my-regex-to-allow-for-escaped-quotes). – Carlos Nov 21 '12 at 12:14
  • @jackflash the question you linked is a lot more difficult than what we have here. the other one attempts to find strings inside quote while allowing escaped quotes. this one just tries to find quoted strings. – Martin Ender Nov 21 '12 at 12:24
  • @m.buettner If you check the accepted answer you'll see a regex for validating quoted strings. – Carlos Nov 21 '12 at 12:32
  • Hi Flosculus, try this one: `['"]([^'"]+((\\(\"|'))*[^'"])+)['"]`. See my answer below – Rui Jarimba Nov 21 '12 at 12:40

2 Answers2

4

Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.

Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:

$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';

Note that the order of the alternation is important!

The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):

$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';

Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.

This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.

Working demo.

This by the way is equivalent to:

$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';

But here you have to write the alternation in this order again.

Working demo.

One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):

$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';

But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).

Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):

$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Can't see here the downvoting guy I was talking 'bout in the other question. If you're hinting I'm that guy I must say I'm not. In fact, I upvoted your answer. – Carlos Nov 21 '12 at 12:57
  • @jackflash No I didn't imply this. I just saw your answer, and I've been downvoted a lot recently, and never without an explanation. So I just wanted to express my sympathy with you and ridgerunner. – Martin Ender Nov 21 '12 at 12:59
  • Oh, ok! There's one guy who I know downvotes me systematically just for an argument we had one day. – Carlos Nov 21 '12 at 14:21
1

Try this regex:

['"]([^'"]+((\\(\"|'))*[^'"])+)['"]

Given the following string:

"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"

It will match

"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"
Rui Jarimba
  • 11,166
  • 11
  • 56
  • 86
  • Fixed the regex. PS: I'm testing using .NET regexes, but it should work with PHP. – Rui Jarimba Nov 21 '12 at 12:28
  • your escaping is inconsistent (you only escape the double quote once). otherwise it should work (apart from not distinguishing between the two delimiter possibilities) – Martin Ender Nov 21 '12 at 12:39
  • That's fixed too. Now it's handling more than 1 escaped quote – Rui Jarimba Nov 21 '12 at 12:43
  • yup, I already noticed (that's why I edited my comment). now only the escaping and the fact that `'` and `"` are treated interchangeably are left. – Martin Ender Nov 21 '12 at 12:46
  • 1
    I noticed another problem about both our answers. Escaped quotes in front of a quoted string will cause that string to start early. I already fixed mine, so have a look at my last example to see what I mean. – Martin Ender Nov 21 '12 at 12:52
  • I should point out that the syntax of the nested string(s) is important in the functionality. This is not something you have to worry about since if the syntax is wrong then it doesn't matter if the regex works – Flosculus Nov 21 '12 at 13:59
  • @Flosculus then, I believe my answer should work fine for you? – Martin Ender Nov 21 '12 at 14:24
  • Thanks @m.buettner for pointing me that issue. One cannot be too careful regarding regular expressions :) – Rui Jarimba Nov 21 '12 at 14:27
  • @m.buettner Partially yes, however the example starts with the strings in an array. My case will have multiple nested strings, so separating the strings is an important part of this. – Flosculus Nov 21 '12 at 14:30
  • @Flosculus what do you mean by nested? something like the last example in your question? I only used an array in my demo, to show you multiple possible input strings. as you can see, some of them contain multiple strings themselves. and they work fine too. simply assign your input to `$str` and use the code from within the `foreach` only. that array and loop was only for demonstration purposes. – Martin Ender Nov 21 '12 at 14:33
  • @m.buettner Nested as in a conceptual string inside a string. For example when rendering javascript function calls in PHP (not that its a good idea) like `echo('myfunction(\''.$value.'\');');`. Thats what im referring to. – Flosculus Nov 21 '12 at 14:41
  • @Flosculus please simply try out my answer (and look at the individual strings inside the array in the demos). this is exactly what my regex takes care of. – Martin Ender Nov 21 '12 at 14:42
  • This solution is as close as I require to do what I need. The finer touches I can alter with PHP directly. Thanks for your help everyone. And +1 to m.buettner, your examples do work, i will incorporate some of the concepts. – Flosculus Nov 21 '12 at 21:00