How to use use php preg_split with an html string

Question

I am trying to parse a badly formed html table:

A couple of lines of this are:

  Food:</b> Yes<b><br>
  Pool: </b>Beach<b></b><b><br>
  Centre:</b> Yes<b><br>

After spending a lot of time on this with Xpath, I think it is probably better to split the above text into lines use preg_split and parse from there.

The pattern I think would work uses:

<\b><\br>*: <\b>

my code is as follows:

$pattern='</b></br>*:</b>';           
$pattern=preg_quote($pattern,'#');
$chars = preg_split($pattern, $output);
print_r($chars);

I am getting the following error:

Delimiter must not be alphanumeric or backslash

What I am doing wrong?

Regular expressions cannot properly handle HTML in general, and while in some cases you can make assumptions that will allow regex to handle a specific HTML string, it is [strongly recommended against](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — KRyan, Sep 17 '12 at 18:00
Just note that in the "pattern I think would work" you are using a backslash, which must be escaped with another backslash when used in a regular expression. — Alex W, Sep 17 '12 at 18:25

Cal · Accepted Answer · 2012-09-18T16:50:27.317

1

Try this:

$pattern='</b></br>*:</b>';           
$pattern=preg_quote($pattern,'#');
$chars = preg_split('#'.$pattern.'#', $output);
print_r($chars);

The preg_quote function just makes it safely escaped, it doesn't actually add the delimiters for you.

As other people will surely point out, using regular expressions is not a good way to parse HTML :)

Your regular expression is also not going to match what you hope. Here's a version that will probably work for your input:

$in = " Pool: </b>Beach<b></b><b><br>";
$out = explode(':', strip_tags($in));
$key = trim($out[0]);
$value = trim($out[1]);
echo "$key = $value\n";

This removes all the HTML, then splits on the colon, and then removes any surrounding whitespace.

edited Sep 18 '12 at 16:50

answered Sep 17 '12 at 18:01

Cal

7,067
25
28

Just modified the explode to use `':'` instead of `'!:!'` - the latter was left over cruft in the translation from `preg_quote()` – Cal Sep 18 '12 at 16:51
Thanks Cal, Also just wanted to let you know that I ended up using a combination of xpath and http://www.bitrepository.com/web-programming/php/extracting-content-between-two-delimiters.html to solve this. Thank you - Bill – user1592380 Sep 18 '12 at 18:03

score 0 · Answer 2 · answered Sep 17 '12 at 18:02

0

Your pattern needs to start and end with a delimiter; looks like you're using # if I'm reading this correctly, so you should have $pattern = '#</b></br>.*:</b>#';.

Also, you're mixing things up; * is not a simple wildcard in regex. If you mean "any number of any characters," the pattern you need is .*. I've included this above.

answered Sep 17 '12 at 18:02

KRyan

7,308
2
40
68

Guys,Thanks for your help. Cal you were right that the regex would not work correctly. Unfortunately your approach doesn't work either because or the irregular nature of the table's text. I'm going back to xpath to try a different tack. BTW what does '!:!' in explode above do? I used ':' instead - Bill – user1592380 Sep 17 '12 at 19:03
@user61629: You commented on my answer, not Cal's. You need to comment on his so he'll get a little note saying that you did. – KRyan Sep 18 '12 at 06:11
Sorry, I thought you put all comments at the end, I've done it now. – user1592380 Sep 18 '12 at 18:11

How to use use php preg_split with an html string

2 Answers2