2

Intro

In PHP, how would I split a line with this syntax:

<As's\\as'dsd> asqwedasd <sa sdasd> [a sadasd] [<asdsad> [as ddsd]] 'asdsad assd'

into this?

array(5) {
  [0]=>
  string(14) "<As's\\as'dsd>"
  [1]=>
  string(9) "asqwedasd"
  [2]=>
  string(10) "<sa sdasd>"
  [3]=>
  string(10) "[a sadasd]"
  [4]=>
  string(20) "[<asdsad> [as ddsd]]"
  [5]=>
  string(13) "'asdsad assd'"
}

More detailed explanation

Now I'm not the best at explaining, so I hope that the above example explains my situation well enough that you won't need my explanation but here it is anyway:

I want to split this string by every space except some specific ones:

  • If the space is inside angle brackets or square brackets it should NOT split that line. See number 2 and 3.
  • There could potentially be a bracket inside a bracket. This should just be returned as one whole string. See number 4.
  • There could potentially be items that are not in brackets. See number 1.
  • Items not wrapped in brackets will NOT contain spaces UNLESS quoted by apostrophes. See number 5.
  • The items can contain all UTF-8 characters EXCEPT for [ ] < >

Sources that could possibly help

Explode string except where surrounded by parentheses?


Thank you in advance! I know this is a humongous task but I have absolutely no idea how to do this myself.

Community
  • 1
  • 1
  • Just need to learn how to use regexp in this case, there's nothing more to it. Here's a link to some [extremely useful information regarding lookahead and lookbehinds](http://www.regular-expressions.info/lookaround.html) – Ohgodwhy Jun 15 '14 at 19:56
  • @EugenRieck I've read on here that `preg_split()` doesn't know if something's in quotes. –  Jun 15 '14 at 19:58
  • @Ohgodwhy Could you come with an example? I've tried using regex but I came to the conclusion that it wasn't possible. I am terrible at it though, so it may just be my lack of skill. –  Jun 15 '14 at 19:59
  • preg_* doesn't have the concept of something "in quotes" ! – Eugen Rieck Jun 15 '14 at 20:00
  • Sorry, no, I can't. I'm awful at regexp, it would take me hours to put together a functional example. I'm sure there are experts here who could do it in a few minutes. – Ohgodwhy Jun 15 '14 at 20:00
  • @EugenRieck Well that's what I read on here. As I've said, I'm terrible at it. –  Jun 15 '14 at 20:01
  • @Ohgodwhy It's fine. Just for the records, this is from a guy I know who's good at regex: https://twitter.com/Darksonn/status/478265704987525120 "Something more sophisticated is needed" –  Jun 15 '14 at 20:01
  • You're in for the long haul here. Can I ask why you need to do this, anyway? perhaps there's a better solution that can be employed to your particular case. – Ohgodwhy Jun 15 '14 at 20:03
  • @Ohgodwhy This is code read from a private API that needs to be split into something more usable as this format is horrible. –  Jun 15 '14 at 20:04
  • Added working PHP demo to my answer. It will give you the "splits". :) – zx81 Jun 15 '14 at 20:28
  • 1
    You may really need a parser. https://www.google.ie/webhp?sourceid=chrome-instant&ion=1&espv=2&es_th=1&ie=UTF-8#q=writing%20a%20parser%20in%20php – Toby Allen Jun 15 '14 at 20:31
  • 1
    I like modular regex, [here](http://regex101.com/r/qR1oU2)'s a solution you might use with `preg_split()`. Sorry for the lack of explanation, too busy :) – HamZa Jun 15 '14 at 20:49
  • 1
    @HamZa That's a beautiful solution. :) – zx81 Jun 15 '14 at 22:10

2 Answers2

4

With all the disclaimers about using regex to parse html... And only if you're ready for some recursive beauty...

Matching What you Want vs. Splitting on What You Don't Want

If you're going to use regex, in this case, to get your array, matching what you do want will be easier than splitting on what you don't want. Here is a starting place, which we can refine:

(\[(?:[^[\]]++|(?1))*\])|<[^>]*>|'[^']*'|[!-~]+

See demo.

How it works:

  • We match several possibilities, separated by the alternation operator |
  • The first match option (\[(?:[^[\]]++|(?1))*\]) recursively matches all [sets of [brackets]]
  • The <[^>]*> matches `'
  • The '[^']*' matches 'complete quotes'. If needed, it could be improved to account for potential escaped quotes \'
  • The [!-~]+ matches any non-space printable characters that remain. It is a guess, based on the lone word asqwedasd in your input, and that too could be refined. For instance, if you want to specify, for validation purposes, that the leftover strings have no <>[] characters, you can use this instead (suggested by @CasimiretHippolyte) \s*\K[^[<]+(?<!\s)

Sample code

See this output of this demo. The array $m[0] contains the "splits" you wanted.

$regex = "%(\[(?:[^[\]]++|(?1))*\])|<[^>]*>|'[^']*'|[!-~]+%";
$string = "<As's\\as'dsd> asqwedasd <sa sdasd> [a sadasd] [<asdsad> [as ddsd]] 'asdsad assd'";
$count = preg_match_all($regex,$string,$m);
print_r($m[0]);

Another Solution

@HamZa came up with another solution which I find quite beautiful. He didn't want to post it himself, but was happy for me to add it here for completion.

How does it work? The idea is to match the right space characters, and to split on them. The base principle for this is explained in detail in this question about "regex-matching a pattern unless...". First, in a similar fashion to my regex (but with more checks and recursion), he defines all the groups we want to match, and matches them. Then, he uses (*SKIP)(*F) to make the regex fail if these groups are matched, after which the engine skips to the position in the string that follows the last character that was matched. On the other side of the alternation, he matches the space characters we will split on, and we know these are the right space characters because they were not matched by the expression on the left. At this stage, we can use preg_split.

A further refinement is the use of what I call the HRRT, which stands for the HamZa Regex Refactoring Technique. To make the regex digestible, he breaks it down into smaller named patterns: singlequotes, brackets and so on. This lets him define another name: skippable, for all these groups. After the definitions, the matching begins. If we can match the skippable pattern, the regex fails with (*SKIP)(*F) and the engine skips to the next position in the string.

That is the gist of it.

Here's the demo.

(?(DEFINE)
   (?P<signs>
      <
         (?:
            [^<>]
            |
            (?&signs)
         )*
      >
   )

   (?P<brackets>
      \[
         (?:
            [^][]
            |
            (?&brackets)
         )*
      \]
   )

   (?P<singlequotes>
      (?<!\\)'(?:[^\\]|\\.)*?'
   )

   (?P<doublequotes>
      (?<!\\)"(?:[^\\]|\\.)*?"
   )

   (?P<quotes>
      (?&singlequotes)|(?&doublequotes)
   )

   (?P<skippable>
      (?&brackets)|(?&signs)|(?&quotes)
   )
)

(?&skippable)(*SKIP)(*FAIL)
|
[ ]+
Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • @Downvoter, care to explain the downvote on this working anwer? – zx81 Jun 15 '14 at 20:30
  • @Locercus `why wouldn't preg_split work?` Sometimes, to split it's easier to match, and sometimes to match it's easier to split. They're two ways to look at the same thing. Imagine this sequence: white, black, white, black.... You want all the whites. You can either split on black, or match the whites. – zx81 Jun 15 '14 at 20:32
  • I didnt downvote, but doesn't this guy need a parser not a regex? – Toby Allen Jun 15 '14 at 20:32
  • @TobyAllen Yeah, I was going too fast and forgot the first line `With all the disclaimers about using regex to parse html...` :) Thanks! – zx81 Jun 15 '14 at 20:34
  • 1
    @Locercus If you have further questions, don't hesitate. :) – zx81 Jun 15 '14 at 20:38
  • This did exactly what I wanted it to. Thank you! –  Jun 15 '14 at 20:39
  • This was a bit tricky because of recursion and matching. If you enjoyed it and are interested in other regex tricks, I suggest you have a look at this one or save it for later, about [Matching or replacing patterns unless...](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#) I had a lot of fun writing it. – zx81 Jun 15 '14 at 20:44
  • @zx81: `[!-~]` doesn't fit the last requirement. You can use this instead `\s*\K[^[<]+(?<!\s)` – Casimir et Hippolyte Jun 15 '14 at 21:06
  • @CasimiretHippolyte For the last one, we don't know what the requirement is, which is why I said `It is a guess`. The various `[braces]` and `` will already have been matched by the alternations on the left, so for the last one, I didn't try to avoid them... What if he had `1+1<3` as a valid match? But your expression would definitely work too. :) – zx81 Jun 15 '14 at 21:17
  • @zx81: I will try will elements that contains substrings like `1+1<3` (or with `[`) to know how to deal with, it can be interesting. But now I have a <°)))))))> to cook. – Casimir et Hippolyte Jun 15 '14 at 21:24
  • @CasimiretHippolyte We both know that with any unbalanced braces the whole thing explodes... Because the `1+1<3` could come before a ``... That's why I say don't worry, trust the input, match balanced braces, then vacuum whatever's left with the rightmost alternation. Or use your expression if we want to specify "no more braces", that's totally cool too. Will add that to the answer as an alternative. – zx81 Jun 15 '14 at 21:28
  • @CasimiretHippolyte Done. The sentence starting with "For instance, " – zx81 Jun 15 '14 at 21:31
  • @zx81: in this kind of situations, the only thing that can be done is the choice of the default behaviour. Note that alpha bravo has tried to check the whole string syntax with `\G`, it's not a bad idea (and it is near a split approach). – Casimir et Hippolyte Jun 15 '14 at 21:33
  • @CasimiretHippolyte Agreed. At the moment his [demo](http://regex101.com/r/sY5aB5) seems broken though (unbalanced `[]`). – zx81 Jun 15 '14 at 21:40
3

Updated:
this pattern also worked for me
(\[(?:[^\[\]]*?|(?R))*\])|(<.*?>)|\G\s([^<>\[\]]+)
Demo

alpha bravo
  • 7,838
  • 1
  • 19
  • 23
  • Thanks for adding a [demo](http://regex101.com/r/sY5aB5). `[a sadasd] [` is not balanced, am I misunderstanding something? – zx81 Jun 15 '14 at 21:39
  • you're right, updated my pattern, had an issue with the recursion, works now. – alpha bravo Jun 15 '14 at 21:52
  • Btw the lazy `?` does not seem to do anything in `[^\[\]]*?`, IMO you'd be better off with a `+` to make it atomic. – zx81 Jun 15 '14 at 22:02