I'm constructing an application to do some text mining based on key words in a linux desktop environment. My goal is to download a web page from a list of Wordpress sites using wget, save the page to disk, then separate each article out for further processing. The idea is that I can rank individual articles down the line based on frequency of certain words. Articles in Wordpress blogs tend to follow the convention:
<article></article>
with the actual write-up in between. So far I've come up with something like this perl code:
$site = "somepage.somedomain"; #can be fed from a database later
$outfile = "out1.txt"; #can be incremented as we go along
$wgcommand = "wget --output-document $outfile $site";
system($wgcommand);
open SITEIN, '<', $outfile;
@sitebodyarr = <SITEIN>;
close SITEIN;
$pagescaler = join('', @sitebodyarr); #let us parse the page.
#this is where I have trouble. the though is to look for a mated pair of tags.
#word press documents are stored between <article> and </article>
$article =~ m/<article>*<\/article>/$pagescaler/g;
#I put the /g flag there, but it doesn't seem to get me
#what I want from the string - *ALL* of the articles one-by-one.
any thoughts on making this match all sets of article tag pairs returned from the html document?
If a regular expression isn't possible, my next thought is to sequentially process on the whole array, catch the pattern
$line =~m/<article>/
and then start a string variable to hold the article contents. Continue concating this variable until I catch the pattern
$line =~m/<\/article>/
then store the string - now containing the article to my database or disk, then repeat until end of @sitebodyarr. But I'd really like a one-liner regex if that's possible. If it is, can someone please show me what it would look like?