2

I'm working with a small subset of mostly invalid HTML, and I need to extract a small piece of data. Given the fact that most of "markup" isn't valid, I don't think that loading everything into a DOM is a good option. Moreover, it seems like a lot of overhead for this simple case.

Here's an example of the markup that I have:

(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)

The <TD><span>Something (random text here)</span></TD> portion does not repeat itself anywhere in the document, so I believe a simple regex would do the trick.

However, I'm terrible with regular expressions.

Should I use a regular expression? Is there a more simple way to do this? If possible, I'd just like to extract the text after Something, the (random text here) portion.

Thanks in advance!

Edit -

Exact example of the HTML (I've omitted the stuff prior, which is the invalid markup that the vendor uses. It's irrelevant for this example, I believe):

<div class="FormTable">
        <TABLE>
        <TR>
                <TD colspan="2">In order to proceed with login operation please 
                answer on the security question below</TD>
        </TR>
        <TR>
                <TD colspan="2">&nbsp;</TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Security Question</label></TD>
                <TD><span>What is your city of birth?</span></TD>
        </TR>
        <TR>
                <TD><label class="FormLabel">Answer</label></TD>
                <TD><INPUT name="securityAnswer" class="input" type="password" value=""></TD>
        </TR>
        </TABLE>
</div>  
Ian P
  • 12,840
  • 6
  • 48
  • 70
  • possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) - which is the [first question in PHP FAQ btw](http://stackoverflow.com/questions/tagged/php?sort=faq&pagesize=50) – Gordon Feb 08 '11 at 15:05
  • I guess the biggest question would be - are there nested `` tags? If not, I think regex should be fine, if that is all you're looking for. Could you give us an example of the actual HTML? – BlueRaja - Danny Pflughoeft Feb 08 '11 at 15:05

4 Answers4

2

If you're sure the opening and closing span tags are on a single line . . .

$ cat test.php
<?php
  $subject = "(a bunch of invalid markup here with unclosed tags, etc.)
              <TD><span>Something (random text here)</span></TD>
              (a bunch more invalid markup here with more unclosed tags.)";

  $pattern = '/<span>.*<\/span>/';

  preg_match($pattern, $subject, $matches);
  print_r($matches);

?>


$ php -f test.php
Array
(
    [0] => <span>Something (random text here)</span>
)

If you're not confident that the span tags are on the same line, you can treat the html as a text file, and grep for the span tags.

$ grep '[</]span>' yourfile.html
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
1

You might read through this answer and the other two it cites. Tackling invalid HTML a bit at a time is actually something you’re apt to have easier luck with using regexes on than using full parsers.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
1

Use of DOM parser is not optimal in your situation. I strongly believe that you need SAX parser, it just extract parts of your document and send appropriate events to your handlers. This method allows to parse broken documents easily.

Examples: http://pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php

Yuri Subach
  • 301
  • 1
  • 4
0

Try using the DOMDOcument::loadHTML() method, it should suppress any validation errors associated with HTML.

Curtis
  • 101,612
  • 66
  • 270
  • 352
Matt
  • 1
  • While I second using DOM for this, the answer is incorrect. `loadHTML` will not suppress validation errors. If you want to suppress parsing errors, you have to use [`libxml_use_internal_errors()`](http://de3.php.net/manual/en/function.libxml-use-internal-errors.php). – Gordon Feb 08 '11 at 15:23