How to use Regex for Static HTML code (PHP)

Question

I am new to Regualr Expressions, and I am just not getting the hang of it yet.

I have grabbed html content from a given webpage using CURL and PHP. This webpage never changes its structure. The results on the page are dependant on a search function, but the html tags are always the same. I need to grab the resulting data from the page depending on what search terms were entered.

The data I need is:

<h1 class="location_only">(555) 555-5555 is a Landline</h1>

So I need to grab whatever is inbetween

<h1 class="location_only"> and </h1>

If I have $data, which is the resulting HTML, how do I put that into a regular expression and echo the data I find as $result?

Can you provide an example or snippet of the html code you're trying to extract from? — Darragh Enright, Apr 23 '12 at 16:10
answered millions times where ... parse the html as xml and take it from there ... don't use regex — scibuff, Apr 23 '12 at 16:14
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — gpojd, Apr 23 '12 at 16:23
Why did it delete my code? I put an example in the original question???? — user1351759, Apr 23 '12 at 16:27
@user1351759: There's a particular syntax for code. An edit should turn up soon. — Andrew Leach, Apr 23 '12 at 16:31
If you're dealing with a HTML document, use a HTML parser. If you just have a string with few `<` and `>` then RegEx is fine. — Salman A, Apr 23 '12 at 17:16

score 2 · Answer 1 · answered Apr 23 '12 at 16:16

2

Please do not use regular expressions to parse HTML.

Please use an HTML Parser, such as Simple HTML DOM Parser. Your problem may seem localized, but it is not. Even if it was, there is a great affinity for problems of this type to grow in scope at a later date which will cause you a massive headache even if you could get it to work with Regular expressions.

answered Apr 23 '12 at 16:16

Jeff Lambert

24,395
4
69
96

Well I have Simple HTML DOM Parser, but I really don't know how to use it in this application. If I did, I would. Can you direct me to a tutorial on that or should I start a new question? – user1351759 Apr 23 '12 at 16:25

ZZ-bb · Answer 2 · 2012-04-23T16:26:17.987

1

You can select text between tags with this search pattern:

<span id="result1">(.*?)</span>

Capture group returns "(555) 555-5555 is a Landline" if your code is: <span id="result1">(555) 555-5555 is a Landline</span>.

See preg_match() for further info how to echo the result.

Also look into HTML DOM Parser like suggested by others. Maybe I shouldn't have answered at all...

edited Apr 23 '12 at 16:26

answered Apr 23 '12 at 16:20

ZZ-bb

2,157
1
24
33

score 0 · Answer 3 · answered Apr 23 '12 at 16:15

You can't reliably extract information from HTML with a regex. You can, however, use an HTML parser, like DOMDocument::LoadHTML. This will take your HTML from a string and then you can use functions like getElementById or getElementByTagName to find your values. There are other HTML parsers out there as well.

score 0 · Answer 4 · answered Apr 23 '12 at 16:17

Both 2 answers telling you not to Regex and instead use a DOM parser are correct, however, if the structure of the page doesn't change, a quick & dirty regex will do the trick just fine, given that you have absolutely well placed start and ending point for reference.

score 0 · Answer 5 · answered Apr 23 '12 at 17:08

You've been cautioned enough to not to use regex to parse HTML. So here is a DOM parser based code to extract your value:

$html = <<< EOF
<html>
<head>
<title>Some Title</title>
</head>
<body>
<H1 class="location_only">(555) 555-5555 is a Landline</H1>
</body>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$value = $xpath->evaluate("string(//h1[@class='location_only']/text())"); 
echo "Your H1 Value=[$value]\n"; // prints text between <h1> and </h1>

OUTPUT:

Your H1 Value=[(555) 555-5555 is a Landline]

How to use Regex for Static HTML code (PHP)

5 Answers5