-1

I'm constructing an application to do some text mining based on key words in a linux desktop environment. My goal is to download a web page from a list of Wordpress sites using wget, save the page to disk, then separate each article out for further processing. The idea is that I can rank individual articles down the line based on frequency of certain words. Articles in Wordpress blogs tend to follow the convention:

 <article></article> 

with the actual write-up in between. So far I've come up with something like this perl code:

  $site = "somepage.somedomain"; #can be fed from a database later
  $outfile = "out1.txt"; #can be incremented as we go along
  $wgcommand = "wget --output-document $outfile $site";
  system($wgcommand);

  open SITEIN, '<', $outfile;
  @sitebodyarr = <SITEIN>;
  close SITEIN;

  $pagescaler = join('', @sitebodyarr); #let us parse the page.

  #this is where I have trouble. the though is to look for a mated pair of tags.
  #word press documents are stored between <article> and </article>

  $article =~ m/<article>*<\/article>/$pagescaler/g;

  #I put the /g flag there, but it doesn't seem to get me
  #what I want from the string - *ALL* of the articles one-by-one.

any thoughts on making this match all sets of article tag pairs returned from the html document?

If a regular expression isn't possible, my next thought is to sequentially process on the whole array, catch the pattern

   $line =~m/<article>/

and then start a string variable to hold the article contents. Continue concating this variable until I catch the pattern

   $line =~m/<\/article>/

then store the string - now containing the article to my database or disk, then repeat until end of @sitebodyarr. But I'd really like a one-liner regex if that's possible. If it is, can someone please show me what it would look like?

Micah
  • 514
  • 3
  • 11
  • You should use an XML parser. `XML::Simple` would probably be enough to do what you want. http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Cfreak Oct 27 '13 at 03:33
  • 2
    Are you sure you want to [parse HTML with RegEx](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html)? – brasofilo Oct 27 '13 at 06:21
  • You know somewhere in the back of my mind that codinghorror article did bubble up. But I didn't heed Mr. Atwood's warning because, well, it's just a finite domain of pulling paragraphs out of wordpress when it's all in the same format - not so much so full parsing as seeking out one pair of tags in a limited scenario. +1 for the link though! – Micah Oct 27 '13 at 07:23

2 Answers2

2

Check out the Mojo suite which includes gorgeous modules like Mojo::DOM – web scraping made fun and easy.

use strict; use warnings;
use feature 'say';
use Mojo;

my $ua = Mojo::UserAgent->new;
my $request = $ua->get('http://example.com/');
if (my $resp = $request->success) {
  my $dom = $resp->dom();
  for my $article ($dom->find('article')->each) {
    say "$article";
  }
}

# short version:

say for Mojo::UserAgent->new->get('http://example.com/')->res->dom('article')->each;

You can use CSS selectors to navigate the DOM.

amon
  • 57,091
  • 2
  • 89
  • 149
  • Probably a better choice than trying to come at this problem with RegEx. And it does what I need it to without having to write something akin to a stack to hold a flag for start and end of an article tag in the string. Thanks for this. – Micah Oct 30 '13 at 01:48
1

==>any thoughts on making this match all sets of article tag pairs returned from the html document?

Below code will give you how many times any article appear in html page.

   #!/usr/bin/perl
    open $html_file_handle, "< $html_file";
    while(my $line=<$html_file_handle>) {
        if($line =~ /<article>(.+?)<\/article>/) {
            $counter_hash{$1}++;
        }
    }   
    foreach $article (keys %counter_hash) {
        print "$article  ==> $counter_hash{$article}\n";
    }
Gaurav Pant
  • 4,029
  • 6
  • 31
  • 54