0

I would like to match the contents of a paragraph tag using a perl reg ex one liner. The paragraph is something like this:

<p style="font-family: Calibri,Helvetica,serif;">Text I want to extract</p>

so I have been using something like this:

perl -nle 'm/<p>($.)<\/p>/ig; print $1' file.html

Any ideas appreciated

thanks

John
  • 63
  • 1
  • 2
  • 6
  • 2
    Why regular expressions? That aren't generally well suited to HTML parsing. `perl -MHTML::TreeBuilder -e'print HTML::TreeBuilder->new_from_file("filename.html")->find("p")->as_text'` – Quentin Feb 17 '11 at 11:58
  • @David: I use HTML::TreeBuilder quite a bit in programs, but I confess it never occurred to me use it in a one-liner! – tchrist Feb 17 '11 at 12:32
  • This works perfectly... any idea how you would pass *.html instead of filename.html? I would like get all paragraphs from files in a directory. – John Feb 17 '11 at 16:10
  • print map {HTML::TreeBuilder->new_from_file($_)->find("p")->as_text} grep {/.*\.html} File::Util->list_dir('/some/dir'); or the like? – Oesor Feb 17 '11 at 18:39

2 Answers2

5

Mandatory link to what happens when you try to parse HTML with regular expressions.

David Dorward's comment, to use HTML::TreeBuilder, is a good one. Another good way to do this, is by using HTML::DOM:

perl -MHTML::DOM -e 'my $dom = HTML::DOM->new(); $dom->parse_file("file.html"); my @p = $dom->getElementsByTagName("p"); print $p[0]->innerText();'
Community
  • 1
  • 1
mscha
  • 6,509
  • 3
  • 24
  • 40
  • 2
    Mandatory links [here](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) and [here](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326) showing what happens when *I* – but maybe not *you* ☺ – try to parse HTML with regexes. [Here’s one more explaining](http://stackoverflow.com/questions/4933611/can-extended-regex-implementations-parse-html/4934590#4934590) that just because you can doesn’t mean you should. – tchrist Feb 17 '11 at 12:24
1

$ in matching part means 'end-of-the-string' and you need also match all in p-tag non-greedy way:

perl -nle 'm/<p.*?>(.+)<\/p/ig; print $1' test.html

w.k
  • 8,218
  • 4
  • 32
  • 55
  • This does the trick more or less, although comments on the futility of parsing HTML with regular expressions are also valid. I have to use regular expressions because I don't have the root password to the box I'm trying to do this on, which precludes the installation of any nice HTML::TreeBuilder libraries. – John Feb 17 '11 at 13:23
  • 2
    @John, you don't need root to install modules. http://stackoverflow.com/questions/540640/how-can-i-install-a-cpan-module-into-a-local-directory – tadmc Feb 17 '11 at 13:36
  • @tadmc, thanks for that, unfortunately I have never never much success with installing perl modules locally. – John Feb 17 '11 at 16:06