0

I have a large document - I need to parse it and spit out only this part: schule.php?schulnr=80287&lschb=

how do I parse the stuff!?

<td>
    <A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
        <center><img border=0 height=16 width=15 src="sh_info.gif"></center>
    </A>
</td>

Love to hear from you

Tae-Sung Shin
  • 20,215
  • 33
  • 138
  • 240
zero
  • 1,003
  • 3
  • 20
  • 42
  • 3
    Use a regular expression and bow to the dark lord. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Stephen Dec 01 '10 at 23:30
  • I was about to say "what kind of a dolt posts a blog post about how to do this Bad Thing... then I noticed it was Coding Horror :) [ for the un-initiated, Coding Horror blog owner is one of the 2 co-founders of StackOverflow and definitely a better programmer than myself :) ] – DVK Dec 02 '10 at 04:35

4 Answers4

5

You ought to use a DOM parser like PHP Simple HTML DOM Parser

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';
Chris
  • 4,393
  • 1
  • 27
  • 33
  • hi Rfygyhn - many thanks. I will do this! I come back and let you know what i have experienced. best regards – zero Dec 01 '10 at 23:36
5

In Perl, the quickest and best way, I know to scan HTML is HTML::PullParser. This is based on a robust HTML parser, not simple FSA like Perl regex (without recursion).

This is more like a SAX filter, than a DOM.

use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::PullParser ();

my $pp 
    = HTML::PullParser->new(
      # your file or even a handle
      file        => 'my.html'
      # specifies that you want a tuple of tagname, attribute hash
    , start       => 'tag, attr' 
      # you only want to look at tags with tagname = 'a'
    , report_tags => [ 'a' ],
    ) 
    or die "$OS_ERROR"
    ;

my $anchor_url;
while ( defined( my $t = $pp->get_token )) { 
    next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
    my $href = $t->[1]->{href};
    if ( index( $href, 'schule.php?' ) > NOT_FOUND ) { 
        $anchor_url = $href;
        last;
    }
}
Axeman
  • 29,660
  • 2
  • 47
  • 102
4

What Rfvgyhn said, but in Perl flavor since that was one of the tags: use HTML::TreeBuilder

Plus, for reasons as to why RegEx is almost never a good idea to parse XML/HTML (sometimes it's Good Enough With Major Caveats), read the obligatory and infamous StackOverflow post:

RegEx match open tags except XHTML self-contained tags

Mind you, if the full extent of your task is literally "parse out HREF links", AND you don't have "<link>" tags AND the links (e.g. HREF="something" substrings) are guaranteed not to be used in any other context (e.g. in comments, or as text, or have "HREF=" be part of the link itself), it just might fall into the "Good Enough" category above for regex usage:

my @lines = <>; # Replace with proper method of reading in your file
my @hrefs = map { $_ =~ /href="([^"]+)"/gi; } @lines;
Community
  • 1
  • 1
DVK
  • 126,886
  • 32
  • 213
  • 327
3

You could also do it this way (it's not perl but more "visual"):

  • Load the document into your browser, if possible
  • Install Firebug extension/add-on
  • Install FirePath extension
  • Copy + Paste this XPath expression into the text field labeled "XPpath:"

    //a[contains(@href, "schule")]/@href

  • Click "Eval" button.

There are also tools to do this on the command line, e.g. "xmllint" (for unix)

xmllint --html --xpath '//a[contains(@href, "schule")]/@href' myfile.php.or.html

You could do further processing from thereon.

knb
  • 9,138
  • 4
  • 58
  • 85