4
$str = "'ei-1395529080',0,0,1,1,'Name','email@domain.com','Sentence with \'escaped apostrophes\', which \'should\' be on one line!','no','','','yes','6.50',NULL";

preg_match_all("/(')?(.*?)(?(1)(?!\\\\)'),/s", $str.',', $values);
print_r($values);

I'm trying to write a regex with these goals:

  1. Return an array of , separated values (note I append to $str on line 2)
  2. If the array item starts with an ', match the closing '
  3. But, if it is escaped like \', keep capturing the value until an ' with no preceeding \ is found

If you try out those lines, it misbehaves when it encounters \',

Can anyone please explain what is happening and how to fix it? Thanks.

daviestar
  • 4,531
  • 3
  • 29
  • 47

2 Answers2

3

This is how I would go about solving this:

('(?>\\.|.)*?'|[^\,]+)

Regex101

Explanation:

(              Start capture group
    '          Match an apostrophe
    (?>        Atomically match the following
        \\.    Match \ literally and then any single character
        |.     Or match just any single character
    )          Close atomic group
    *?'        Match previous group 0 or more times until the first '
    |[^\,]     OR match any character that is not a comma (,)
    +          Match the previous regex [^\,] one or more times
)              Close capture group

A note about how the atomic group works:

Say I had this string 'a \' b'

The atomic group (?>\\.|.) will match this string in the following way at each step:

  1. '
  2. a
  3. \'
  4. b
  5. '

If the match ever fails in the future, it will not attempt to match \' as \, ' but will always match/use the first option if it fits.


If you need help escaping the regex, here's the escaped version: ('(?>\\\\.|.)*?'|[^\\,]+)


although i spent about 10 hours writing regex yesterday, i'm not too experienced with it. i've researched escaping backslashes but was confused by what i read. what's your reason for not escaping in your original answer? does it depend on different languages/platforms? ~OP

Section on why you have to escape regex in programming languages.

When you write the following string:

"This is on one line.\nThis is on another line."

Your program will interpret the \n literally and see it the following way:

"This is on one line.
 This is on another line."

In a regular expression, this can cause a problem. Say you wanted to match all characters that were not line breaks. This is how you would do that:

"[^\n]*"

However, the \n is interpreted literally when written in a programming language and will be seen the following way:

"[^
 ]*"

Which, I'm sure you can tell, is wrong. So to fix this we escape strings. By placing a backslash in front of the first backslash when can tell the programming language to look at \n differently (or any other escape sequence: \r, \t, \\, etc). On a basic level, escape trade the original escape sequence \n for another escape sequence and then a character \\, n. This is how escaping affects the regex above.

"[^\\n]*"

The way the programming language will see this is the following:

"[^\n]*"

This is because \\ is an escape sequence that means "When you see \\ interpret it literally as \". Because \\ has already been consumed and interpreted, the next character to read is n and therefore is no longer part of the escape sequence.

So why do I have 4 backslashes in my escaped version? Let's take a look:

(?>\\.|.)

So this is the original regex we wrote. We have two consecutive backslashes. This section (\\.) of the regular expression means "Whenever you see a backslash and then any character, match". To preserve this interpretation for the regex engine, we have to escape each, individual backslash.

\\ \\ .

So all together it looks like this:

(?>\\\\.|.)
d0nut
  • 2,835
  • 1
  • 18
  • 23
  • This currently isn't working for me on the test case. I'm using it like `/('(?>\.|.)*?'|[^\,]+)/s` – daviestar Dec 11 '15 at 03:55
  • @daviestar you have to escape backslashes when using regular expressions in programming languages... You have already shown that you know how to do it in your attempt `(?!\\\\)` – d0nut Dec 11 '15 at 04:21
  • @daviestar I have added in an escaped version as an edit in my answer. – d0nut Dec 11 '15 at 04:23
  • although i spent about 10 hours writing regex yesterday, i'm not too experienced with it. i've researched escaping backslashes but was confused by [what i read](http://stackoverflow.com/questions/11044136/right-way-to-escape-backslash-in-php-regex#15369828). what's your reason for not escaping in your original answer? does it depend on different languages/platforms? – daviestar Dec 11 '15 at 14:05
  • @daviestar no problem. The reason why I didn't escape originally is because escaping is not something that is specific to regular expressions being placed into programming languages, it's specific to strings. I'll edit my answer with a small section explaining this. – d0nut Dec 11 '15 at 14:23
  • @daviestar to explain why i didn't escape it originally: I wrote the regular expression in it's purest form: without escaping. This is because you can take my regular expression and plug it into tons of different tools or websites and see it work just fine. The fact that you have to escape regexs in programming languages are just consequences of the programming language itself and are not specific to regular expressions. – d0nut Dec 11 '15 at 14:42
  • thanks for all your advice. i've been evaluating which technique i prefer to mark as best answer, and so far i prefer yours as i end up with 1 array of values and not 2.. however both answers capture the apostrophes at the start and end of the strings unlike mine.. is it possible to not capture those? – daviestar Dec 11 '15 at 15:00
  • @daviestar that's not too hard at all. Let me make one more edit to not include `'` – d0nut Dec 11 '15 at 15:28
  • @daviestar hmm, it seems harder than I thought, the way I wrote it, to exclude the apostrophes in a clean way. Personally, i would just use `trim($string, "'")` on each item in the resulting list to remove the starting and ending apostrophes – d0nut Dec 11 '15 at 15:52
  • that's a nice way of doing it, i've been using `substr($value, 1, -1)` (i have other code which already knows if it's an int or a string). i like your way better! didn't know about 2nd arg on `trim()` – daviestar Dec 11 '15 at 15:56
  • @daviestar yea, it's an optional parameter which allows you to trim a whole set of characters. In our case, however, we only need to trim `"'"` :) – d0nut Dec 11 '15 at 15:57
  • 1
    Using `trim($string, "'")` is a bad idea, it would incorrect convert the string `'\''` to a single slash when it should be `\'` (an escaped slash). – Dean Taylor Dec 13 '15 at 03:13
2

Something like this: (?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))

Regular expression visualization

# (?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))
# 
# Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers
# 
# Match the regular expression below «(?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))»
#    Match this alternative (attempting the next alternative only if this one fails) «'([^'\\]*(?:\\.[^'\\]*)*)'»
#       Match the character “'” literally «'»
#       Match the regex below and capture its match into backreference number 1 «([^'\\]*(?:\\.[^'\\]*)*)»
#          Match any single character NOT present in the list below «[^'\\]*»
#             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#             The literal character “'” «'»
#             The backslash character «\\»
#          Match the regular expression below «(?:\\.[^'\\]*)*»
#             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#             Match the backslash character «\\»
#             Match any single character that is NOT a line break character (line feed) «.»
#             Match any single character NOT present in the list below «[^'\\]*»
#                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#                The literal character “'” «'»
#                The backslash character «\\»
#       Match the character “'” literally «'»
#    Or match this alternative (the entire group fails if this one fails to match) «([^,]+)»
#       Match the regex below and capture its match into backreference number 2 «([^,]+)»
#          Match any character that is NOT a “,” «[^,]+»
#             Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

https://regex101.com/r/pO0cQ0/1

preg_match_all('/(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*)\'|([^,]+))/', $subject, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
    // @todo here use $result[$matchi][1] to match quoted strings (to then process escaped quotes)
    // @todo here use $result[$matchi][2] to match unquoted strings
}
Dean Taylor
  • 40,514
  • 3
  • 31
  • 50
  • 1
    What tool are you using that creates that awesome flow diagram? – d0nut Dec 11 '15 at 03:50
  • This works with the 4 backslashes, like your 2nd example :) Thanks! – daviestar Dec 11 '15 at 04:01
  • @DeanTaylor thanks for your help, but in the end I chose the other regex as it returns all the values in one array, which makes a bit more sense in my project. – daviestar Dec 11 '15 at 16:06
  • @daviestar It probably doesn't matter to you - but the other Regular Expression may well be many times slower to even match the first part of your example string `'ei-1395529080'`. The other regex takes 56 steps mine only takes 7 steps. – Dean Taylor Dec 11 '15 at 16:55
  • @daviestar And for your full example string 487 steps vs. 98 steps. – Dean Taylor Dec 11 '15 at 17:02
  • That's interesting, I originally intended to choose the best answer based on performance (a 40MB SQL dump converted to separate JSON files for each table), however it was too fiddly to make your regex work in the larger project. I am blown away by the performance of @iismathwizard's answer - 1.8 seconds – daviestar Dec 11 '15 at 17:15
  • Is it viable to remove the apostrophes from the returned strings with your regex? If so, I'll spend some time refactoring my script and post benchmarks – daviestar Dec 11 '15 at 17:24
  • @daviestar according to the benchmarks on regex101, my regex is about 3 times as slow as Dean's. If your data has a lot of `'` wrapped content it'll be slower. I wrote the regular expression for simplicity, however. – d0nut Dec 11 '15 at 18:48
  • The outer quotes are already removed with my version, you just have to do a replace on the inner escaped quotes. – Dean Taylor Dec 11 '15 at 21:18
  • @DeanTaylor I've benchmarked both versions (average of 10) and yours comes in faster at around 1.4 seconds vs. 2.2 seconds. However side effects of your pattern are: 2 arrays of parsed values is annoying to deal with; and as strings automatically lose their outer apostrophes, NULL values end up as literally `""` in the JSON output - which of course makes sense but it's just another thing to work around, so I am sticking with my original correct answer which for my 'fast and loose' project, works perfectly. Thanks. – daviestar Dec 15 '15 at 11:51
  • @DeanTaylor to be accurate, the `NULL` values are handled correctly by your regex, it's just that the 2 resulting arrays are not easily aligned with the column types, so I can't use `if(colType === 'int') $result[2] else $result[1]`. – daviestar Dec 15 '15 at 12:24
  • You get performance gains from matching strings seperately, you also process the unescaping of quotes correctly this way. You can easily know the column type, in my example `$matchi` will always be the column index, lending to easy lookup of the column type. But inside the loop it's easier to think `if ( isset( $result[$matchi][2] ) ) { /* Not a string */ }` additionally `$result[$matchi][0]` is the complete matched string including the quotes if you really want them. – Dean Taylor Dec 16 '15 at 01:52