0

I am trying to parse a badly formed html table:

A couple of lines of this are:

  Food:</b> Yes<b><br>
  Pool: </b>Beach<b></b><b><br>
  Centre:</b> Yes<b><br>

After spending a lot of time on this with Xpath, I think it is probably better to split the above text into lines use preg_split and parse from there.

The pattern I think would work uses:

<\b><\br>*: <\b>

my code is as follows:

$pattern='</b></br>*:</b>';           
$pattern=preg_quote($pattern,'#');
$chars = preg_split($pattern, $output);
print_r($chars);

I am getting the following error:

Delimiter must not be alphanumeric or backslash

What I am doing wrong?

Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119
user1592380
  • 34,265
  • 92
  • 284
  • 515
  • Regular expressions cannot properly handle HTML in general, and while in some cases you can make assumptions that will allow regex to handle a specific HTML string, it is [strongly recommended against](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – KRyan Sep 17 '12 at 18:00
  • Just note that in the "pattern I think would work" you are using a backslash, which must be escaped with another backslash when used in a regular expression. – Alex W Sep 17 '12 at 18:25

2 Answers2

1

Try this:

$pattern='</b></br>*:</b>';           
$pattern=preg_quote($pattern,'#');
$chars = preg_split('#'.$pattern.'#', $output);
print_r($chars);

The preg_quote function just makes it safely escaped, it doesn't actually add the delimiters for you.

As other people will surely point out, using regular expressions is not a good way to parse HTML :)

Your regular expression is also not going to match what you hope. Here's a version that will probably work for your input:

$in = " Pool: </b>Beach<b></b><b><br>";
$out = explode(':', strip_tags($in));
$key = trim($out[0]);
$value = trim($out[1]);
echo "$key = $value\n";

This removes all the HTML, then splits on the colon, and then removes any surrounding whitespace.

Cal
  • 7,067
  • 25
  • 28
  • Just modified the explode to use `':'` instead of `'!:!'` - the latter was left over cruft in the translation from `preg_quote()` – Cal Sep 18 '12 at 16:51
  • Thanks Cal, Also just wanted to let you know that I ended up using a combination of xpath and http://www.bitrepository.com/web-programming/php/extracting-content-between-two-delimiters.html to solve this. Thank you - Bill – user1592380 Sep 18 '12 at 18:03
0

Your pattern needs to start and end with a delimiter; looks like you're using # if I'm reading this correctly, so you should have $pattern = '#</b></br>.*:</b>#';.

Also, you're mixing things up; * is not a simple wildcard in regex. If you mean "any number of any characters," the pattern you need is .*. I've included this above.

KRyan
  • 7,308
  • 2
  • 40
  • 68
  • Guys,Thanks for your help. Cal you were right that the regex would not work correctly. Unfortunately your approach doesn't work either because or the irregular nature of the table's text. I'm going back to xpath to try a different tack. BTW what does '!:!' in explode above do? I used ':' instead - Bill – user1592380 Sep 17 '12 at 19:03
  • @user61629: You commented on my answer, not Cal's. You need to comment on his so he'll get a little note saying that you did. – KRyan Sep 18 '12 at 06:11
  • Sorry, I thought you put all comments at the end, I've done it now. – user1592380 Sep 18 '12 at 18:11