With all the disclaimers about using regex to parse html... And only if you're ready for some recursive beauty...
Matching What you Want vs. Splitting on What You Don't Want
If you're going to use regex, in this case, to get your array, matching what you do want will be easier than splitting on what you don't want. Here is a starting place, which we can refine:
(\[(?:[^[\]]++|(?1))*\])|<[^>]*>|'[^']*'|[!-~]+
See demo.
How it works:
- We match several possibilities, separated by the alternation operator
|
- The first match option
(\[(?:[^[\]]++|(?1))*\])
recursively matches all [sets of [brackets]]
- The
<[^>]*>
matches `'
- The
'[^']*'
matches 'complete quotes'
. If needed, it could be improved to account for potential escaped quotes \'
- The
[!-~]+
matches any non-space printable characters that remain. It is a guess, based on the lone word asqwedasd
in your input, and that too could be refined. For instance, if you want to specify, for validation purposes, that the leftover strings have no <>[]
characters, you can use this instead (suggested by @CasimiretHippolyte) \s*\K[^[<]+(?<!\s)
Sample code
See this output of this demo. The array $m[0]
contains the "splits" you wanted.
$regex = "%(\[(?:[^[\]]++|(?1))*\])|<[^>]*>|'[^']*'|[!-~]+%";
$string = "<As's\\as'dsd> asqwedasd <sa sdasd> [a sadasd] [<asdsad> [as ddsd]] 'asdsad assd'";
$count = preg_match_all($regex,$string,$m);
print_r($m[0]);
Another Solution
@HamZa came up with another solution which I find quite beautiful. He didn't want to post it himself, but was happy for me to add it here for completion.
How does it work? The idea is to match the right space characters, and to split on them. The base principle for this is explained in detail in this question about "regex-matching a pattern unless...". First, in a similar fashion to my regex (but with more checks and recursion), he defines all the groups we want to match, and matches them. Then, he uses (*SKIP)(*F)
to make the regex fail if these groups are matched, after which the engine skips to the position in the string that follows the last character that was matched. On the other side of the alternation, he matches the space characters we will split on, and we know these are the right space characters because they were not matched by the expression on the left. At this stage, we can use preg_split
.
A further refinement is the use of what I call the HRRT
, which stands for the HamZa Regex Refactoring Technique. To make the regex digestible, he breaks it down into smaller named patterns: singlequotes
, brackets
and so on. This lets him define another name: skippable
, for all these groups. After the definitions, the matching begins. If we can match the skippable
pattern, the regex fails with (*SKIP)(*F)
and the engine skips to the next position in the string.
That is the gist of it.
Here's the demo.
(?(DEFINE)
(?P<signs>
<
(?:
[^<>]
|
(?&signs)
)*
>
)
(?P<brackets>
\[
(?:
[^][]
|
(?&brackets)
)*
\]
)
(?P<singlequotes>
(?<!\\)'(?:[^\\]|\\.)*?'
)
(?P<doublequotes>
(?<!\\)"(?:[^\\]|\\.)*?"
)
(?P<quotes>
(?&singlequotes)|(?&doublequotes)
)
(?P<skippable>
(?&brackets)|(?&signs)|(?"es)
)
)
(?&skippable)(*SKIP)(*FAIL)
|
[ ]+