What's the easiest way to extract a piece of data from HTML in PHP?

Question

I'm working with a small subset of mostly invalid HTML, and I need to extract a small piece of data. Given the fact that most of "markup" isn't valid, I don't think that loading everything into a DOM is a good option. Moreover, it seems like a lot of overhead for this simple case.

Here's an example of the markup that I have:

(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)

The <TD><span>Something (random text here)</span></TD> portion does not repeat itself anywhere in the document, so I believe a simple regex would do the trick.

However, I'm terrible with regular expressions.

Should I use a regular expression? Is there a more simple way to do this? If possible, I'd just like to extract the text after Something, the (random text here) portion.

Thanks in advance!

Edit -

Exact example of the HTML (I've omitted the stuff prior, which is the invalid markup that the vendor uses. It's irrelevant for this example, I believe):

<div class="FormTable">
        <TABLE>
        <TR>
                <TD colspan="2">In order to proceed with login operation please 
                answer on the security question below</TD>
        </TR>
        <TR>
                <TD colspan="2">&nbsp;</TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Security Question</label></TD>
                <TD><span>What is your city of birth?</span></TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Answer</label></TD>
                <TD><INPUT name="securityAnswer" class="input" type="password" value=""></TD>
        </TR>
        </TABLE>
</div>

possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) - which is the [first question in PHP FAQ btw](http://stackoverflow.com/questions/tagged/php?sort=faq&pagesize=50) — Gordon, Feb 08 '11 at 15:05
I guess the biggest question would be - are there nested `` tags? If not, I think regex should be fine, if that is all you're looking for. Could you give us an example of the actual HTML? — BlueRaja - Danny Pflughoeft, Feb 08 '11 at 15:05

Mike Sherrill 'Cat Recall' · Accepted Answer · 2011-02-08T15:15:28.680

If you're sure the opening and closing span tags are on a single line . . .

$ cat test.php
<?php
  $subject = "(a bunch of invalid markup here with unclosed tags, etc.)
              <TD><span>Something (random text here)</span></TD>
              (a bunch more invalid markup here with more unclosed tags.)";

  $pattern = '/<span>.*<\/span>/';

  preg_match($pattern, $subject, $matches);
  print_r($matches);

?>


$ php -f test.php
Array
(
    [0] => <span>Something (random text here)</span>
)

If you're not confident that the span tags are on the same line, you can treat the html as a text file, and grep for the span tags.

$ grep '[</]span>' yourfile.html

Wow, didn't realize it was so simple. Works perfect for this case. Thanks a bunch. — Ian P, Feb 08 '11 at 15:11

score 1 · Answer 2 · edited May 23 '17 at 12:14

1

You might read through this answer and the other two it cites. Tackling invalid HTML a bit at a time is actually something you’re apt to have easier luck with using regexes on than using full parsers.

edited May 23 '17 at 12:14

Community

1
1

answered Feb 08 '11 at 15:02

tchrist

78,834
30
123
180

score 1 · Answer 3 · answered Feb 08 '11 at 17:38

Use of DOM parser is not optimal in your situation. I strongly believe that you need SAX parser, it just extract parts of your document and send appropriate events to your handlers. This method allows to parse broken documents easily.

Examples: http://pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php

score 0 · Answer 4 · edited Feb 15 '12 at 09:17

0

Try using the DOMDOcument::loadHTML() method, it should suppress any validation errors associated with HTML.

edited Feb 15 '12 at 09:17

Curtis

101,612
66
270
352

answered Feb 08 '11 at 15:05

Matt

1

While I second using DOM for this, the answer is incorrect. `loadHTML` will not suppress validation errors. If you want to suppress parsing errors, you have to use [`libxml_use_internal_errors()`](http://de3.php.net/manual/en/function.libxml-use-internal-errors.php). – Gordon Feb 08 '11 at 15:23

What's the easiest way to extract a piece of data from HTML in PHP?

4 Answers4