0

I'm going to be working with regular expression's a lot in a new project, I don't have much experience with them and was wondering of a good way of converting HTML to a regular expression.

Anybody know of any good tutorials, or perhaps a generator?

At the moment I need to convert this:

<span class="code" id="code" title="DOESNT MATTER">IMPORTANT<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>

Thanks!

3 Answers3

1
$text = '<span class="code" id="code" title="DOESNT MATTER">IMPORTANT<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>';
preg_match('|<span class="code" id="code" title="DOESNT MATTER">IMPORTANT<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>|', $text, $match);

there's nothing to be "converted" if you're not looking for specified title for example

to pick that important you would use

$text = '<span class="code" id="code" title="DOESNT MATTER">IMPORTANT<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>';
preg_match('|<span class="code" id="code" title="DOESNT MATTER">(.*?)<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>|', $text, $match);
echo $match[1]; //IMPORTANT
genesis
  • 50,477
  • 20
  • 96
  • 125
0

If you want to just get rid of all the html around some values, you can just use strip_tags()

Edit: moved the comment into the answer because it was copy/pasting out bad.

<?php
$html = '<span class="code" id="code" title="DOESNT MATTER">IMPORTANT<img class="scissors" src="DOESNT MATTER" alt="DOESNT MATTER" /></span>';
preg_match_all("/<span\s.*?class=\"code\"[^>]+>(.*?)<img\s.*?class=\"scissors\"[^>]+>/i", $html, $matches);
var_dump($matches);
?>

Also, please note that just like said in the comments above, using a regex to parse html is considered bad practice. You should be able to load the html into an instance of DOMDocument and use the getElementsByTagName method to get all spans. Then you can loop through those and validate the attributes/text inside.

Jonathan Kuhn
  • 15,279
  • 3
  • 32
  • 43
  • Yes, but I have a big file of HTML and I'm looking for multiple of the basically same lines. I need the IMPORTANT and nothing else from each. –  Jul 15 '11 at 14:09
  • so what exactly are you looking for? if you just wanted to match 'IMPORTANT' then `/IMPORTANT/` would do it. What exactly makes 'IMPORTANT' important? Is it like a span tag with the class code followed by some text that you want to capture followed by an image tag with the class scissors? `preg_match_all("/]+>(.*?)]+>/i", $html, $matches);var_dump($matches);` – Jonathan Kuhn Jul 15 '11 at 14:15
  • Didn't seem to find anything? :/ –  Jul 15 '11 at 14:19
  • .. and yes I'm looking to output the text in the span tag. –  Jul 15 '11 at 14:20
  • This is what I'm using at the moment: `code`preg_match_all("/]+>(.*?)]+>/i", $printable, $matches); foreach($matches as $match) { echo("$match[1]
    "); }`code`
    –  Jul 15 '11 at 14:22
  • It didn't copy paste well. I moved it into the answer. – Jonathan Kuhn Jul 15 '11 at 14:23
  • Seems to be getting there, the only problem is that it's not picking up on all the similar span's in the html. –  Jul 15 '11 at 14:29
  • It's perfect I just had to add an "PREG_SET_ORDER". Thank you so much! :) –  Jul 15 '11 at 14:33
0

It's worth noting that Regular Expressions are not a great solution for parsing HTML. I think they are fine if you have a small chunk of HTML with a guaranteed format, though.

Please see the following great StackOverflow thread:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Charles Burns
  • 10,310
  • 7
  • 64
  • 81