0

I have this kind of string :

$string = "<strong>Blabla1</strong> Blaabla2<br /> Blaabla3 <strong>Blaabla4</strong> Blaabla5 Blaabla6<br /><br /> Blaabla7 <span style='color:#B22222;'>Blaabla8</span> Blaabla9";

I'm trying to explode each word where there is a " " or "<br />" with preg_split .

My conditions :

For each word (Blablax), I need to keep his tags like <strong>, <span>, <em>... but split him after a <br /> or more <br />

I tried this, thanks to another post on stackoverflow :

preg_split('/<br(\s\/)?>\K|\s/',$string,null,PREG_SPLIT_NO_EMPTY);

OUTPUT:

array (size=12)
  0 => string '<strong>Blabla1</strong>' (length=24)
  1 => string 'Blaabla2<br />' (length=14)
  2 => string 'Blaabla3' (length=8)
  3 => string '<strong>Blaabla4</strong>' (length=25)
  4 => string 'Blaabla5' (length=8)
  5 => string 'Blaabla6<br />' (length=14)
  6 => string '<br' (length=3)
  7 => string '/>' (length=2)
  8 => string 'Blaabla7' (length=8)
  9 => string '<span' (length=5)
  10 => string 'style='color:#B22222;'>Blaabla8</span>' (length=38)
  11 => string 'Blaabla9' (length=8)

Everything works except for index 6 and index 7 (see above in OUTPUT) and index 9 and index 10

What I'll exepect :

array (size=12)
      0 => string '<strong>Blabla1</strong>' (length=24)
      1 => string 'Blaabla2<br />' (length=14)
      2 => string 'Blaabla3' (length=8)
      3 => string '<strong>Blaabla4</strong>' (length=25)
      4 => string 'Blaabla5' (length=8)
      5 => string 'Blaabla6<br /><br />' (length=14)
      6 => string 'Blaabla7' (length=8)
      7 => string '<span style='color:#B22222;'>Blaabla8</span>' (length=45)
      8 => string 'Blaabla9' (length=8)

See index 5 and index 7

My regex works if I have just one <br /> but if more than one, there is a mistakes... idem if I have a <span style...>

Thanks !

Zagloo
  • 1,297
  • 4
  • 17
  • 34
  • try `preg_split('/
    \K|\s/g',$string,null,PREG_SPLIT_NO_EMPTY); `
    – Noman Apr 16 '15 at 10:28
  • Why is there an extra `
    ` at index 1? It does not present in the input.
    – nhahtdh Apr 16 '15 at 10:37
  • @nhahtdh my bad... I edited my post... there is no extra `
    ` at the index 1...
    – Zagloo Apr 16 '15 at 10:39
  • Looking at your requirement for index 7, I think you are better off using a PHP parser. http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – nhahtdh Apr 16 '15 at 10:43

3 Answers3

1
$string = "<strong>Blabla1</strong> Blaabla2<br /> Blaabla3 <strong>Blaabla4</strong> Blaabla5 Blaabla6<br /><br /> Blaabla7 <span style='color:#B22222;'>Blaabla8</span> Blaabla9";

$matches = preg_split('/(<br.*?>|<span.*>)+\K|\s/sim', $string, null, PREG_SPLIT_NO_EMPTY );

var_dump($matches);
    /*
      array(9) {
  [0]=>
  string(24) "<strong>Blabla1</strong>"
  [1]=>
  string(14) "Blaabla2<br />"
  [2]=>
  string(8) "Blaabla3"
  [3]=>
  string(25) "<strong>Blaabla4</strong>"
  [4]=>
  string(8) "Blaabla5"
  [5]=>
  string(20) "Blaabla6<br /><br />"
  [6]=>
  string(8) "Blaabla7"
  [7]=>
  string(44) "<span style='color:#B22222;'>Blaabla8</span>"
  [8]=>
  string(8) "Blaabla9"
}
    */

DEMO

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
1

Looking at your expected array at index 5 and index 7, you probably want this regex:

preg_split('~(?:</?[a-zA-Z0-9][^>]*+>|\S)++\K|\s~',$string,null,PREG_SPLIT_NO_EMPTY);

Demo on ideone

Output:

array(9) {
  [0]=>
  string(24) "<strong>Blabla1</strong>"
  [1]=>
  string(14) "Blaabla2<br />"
  [2]=>
  string(8) "Blaabla3"
  [3]=>
  string(25) "<strong>Blaabla4</strong>"
  [4]=>
  string(8) "Blaabla5"
  [5]=>
  string(20) "Blaabla6<br /><br />"
  [6]=>
  string(8) "Blaabla7"
  [7]=>
  string(44) "<span style='color:#B22222;'>Blaabla8</span>"
  [8]=>
  string(8) "Blaabla9"
}

The regex attempts to match a full tag, and if a full tag can't be consumed, it will consume one non-space character, then rinse and repeat. This will prevent tags from being split, which gives expected output for index 5 and 7.

I wouldn't recommend doing this with regex, though. I didn't consult the HTML specs when writing the regex, so the regex is very brittle and may break on input in the wild. You might want to learn how to parse HTML properly with one of the libraries listed in this question: How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

Here is the regex

((?:<br\s*\/?>)+)|(?<!<br)\s+(?!\/?>)

Use this with preg_replace using $1\n as a replacement string, and then you can split by newline to get the array (removing empty ones).

See demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563