2

How would I use PHP's preg_replace() to return only the value inside the <h1> in the following string (it's HTML text loaded in a variable called $html):

<h1>I'm Header</h1>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque tincidunt porttitor magna, quis molestie augue sagittis quis.</p>

<p>Pellentesque tincidunt porttitor magna, quis molestie augue sagittis quis. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>

I've tried this: preg_replace('#<h1>([.*])</h1>.*#', '$1', $html), but to no avail. Am I regex-ing this correctly? And is there a better PHP function that I should be using instead of preg_replace?

Sam
  • 2,152
  • 6
  • 31
  • 44
  • 20
    Umm...just a bit of sidebar topic here: as I was typing this post (most of the way through), a weird unicorn graphic showed up on the right side of the page and, like MS Clippy, asked me if I wanted help parsing XML, and then sent me here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 When I came back to my post to take a screenshot of the unicorn, it was gone. Somebody please tell me that wasn't a hallucination. Somebody? Anybody? Hello? – Sam Apr 02 '12 at 02:11
  • 6
    that wasn't hallucination, and generally you shouldn't want parsing HTML with regexes – zerkms Apr 02 '12 at 02:12
  • No hallucination. That's good. So... what what it??? – Sam Apr 02 '12 at 02:18
  • it was a link to a thread that explains that generally you shouldn't parse HTML using regular expressions – zerkms Apr 02 '12 at 02:22
  • Haha! Okay... but why the unicorn??? – Sam Apr 02 '12 at 02:25
  • 3
    @Sam You must be new here... :-3 – deceze Apr 02 '12 at 03:45
  • 1
    Maybe check what date you saw the unicorn? ;) – Peter Apr 02 '12 at 07:58
  • 1
    [I want this unicorn as a regular feature](http://meta.stackexchange.com/questions/127823/please-keep-the-aprils-1st-unicorn-for-parse-html-using-regex-questions) – stema Apr 02 '12 at 09:04

2 Answers2

4

([.*]) means dot OR astersk

What you need is (.*?), which means any amount of any characters ungreedy

or

([^<]*) - which means any amount of any characters but not <

zerkms
  • 249,484
  • 69
  • 436
  • 539
4

Here is how you do it using preg_replace:

$header = preg_replace('/<h1>(.*)<\/h1>.*/iU', '$1', $html);

You can also use preg_match:

$matches = array();
preg_match('/<h1>(.*)</h1>.*/iU', $html, $matches);
print_r($matches);
Tamik Soziev
  • 14,307
  • 5
  • 43
  • 55
  • `.*` in the end is harmful - it would cut all the other text off – zerkms Apr 02 '12 at 02:13
  • but he wants to get only the h1 tag contents...he does not care about the rest. – Tamik Soziev Apr 02 '12 at 02:15
  • Oh, yes. It wouldn't destroy the data, but still pointless ;-) – zerkms Apr 02 '12 at 02:21
  • @TamikSoziev This is close. When I echo($header) I get the contents of the

    successfully, but the rest of the HTML is there too. I just want to extract the guts of the

    .

    – Sam Apr 02 '12 at 02:22
  • WooHoo!!! preg_match() did it! In an effort to make me a better regexer, can you explain the "iU" part? Many thanks! – Sam Apr 02 '12 at 22:21