Extract contents of paragraph tag using a Perl one liner

Question

I would like to match the contents of a paragraph tag using a perl reg ex one liner. The paragraph is something like this:

<p style="font-family: Calibri,Helvetica,serif;">Text I want to extract</p>

so I have been using something like this:

perl -nle 'm/<p>($.)<\/p>/ig; print $1' file.html

Any ideas appreciated

thanks

Why regular expressions? That aren't generally well suited to HTML parsing. `perl -MHTML::TreeBuilder -e'print HTML::TreeBuilder->new_from_file("filename.html")->find("p")->as_text'` — Quentin, Feb 17 '11 at 11:58
@David: I use HTML::TreeBuilder quite a bit in programs, but I confess it never occurred to me use it in a one-liner! — tchrist, Feb 17 '11 at 12:32
This works perfectly... any idea how you would pass *.html instead of filename.html? I would like get all paragraphs from files in a directory. — John, Feb 17 '11 at 16:10
print map {HTML::TreeBuilder->new_from_file($_)->find("p")->as_text} grep {/.*\.html} File::Util->list_dir('/some/dir'); or the like? — Oesor, Feb 17 '11 at 18:39

score 5 · Answer 1 · edited May 23 '17 at 10:27

5

Mandatory link to what happens when you try to parse HTML with regular expressions.

David Dorward's comment, to use HTML::TreeBuilder, is a good one. Another good way to do this, is by using HTML::DOM:

perl -MHTML::DOM -e 'my $dom = HTML::DOM->new(); $dom->parse_file("file.html"); my @p = $dom->getElementsByTagName("p"); print $p[0]->innerText();'

edited May 23 '17 at 10:27

Community

answered Feb 17 '11 at 12:12

mscha

2

Mandatory links [here](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) and [here](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326) showing what happens when *I* – but maybe not *you* ☺ – try to parse HTML with regexes. [Here’s one more explaining](http://stackoverflow.com/questions/4933611/can-extended-regex-implementations-parse-html/4934590#4934590) that just because you can doesn’t mean you should. – tchrist Feb 17 '11 at 12:24

score 1 · Accepted Answer · answered Feb 17 '11 at 11:53

1

$ in matching part means 'end-of-the-string' and you need also match all in p-tag non-greedy way:

perl -nle 'm/<p.*?>(.+)<\/p/ig; print $1' test.html

answered Feb 17 '11 at 11:53

w.k

This does the trick more or less, although comments on the futility of parsing HTML with regular expressions are also valid. I have to use regular expressions because I don't have the root password to the box I'm trying to do this on, which precludes the installation of any nice HTML::TreeBuilder libraries. – John Feb 17 '11 at 13:23
2

@John, you don't need root to install modules. http://stackoverflow.com/questions/540640/how-can-i-install-a-cpan-module-into-a-local-directory – tadmc Feb 17 '11 at 13:36
@tadmc, thanks for that, unfortunately I have never never much success with installing perl modules locally. – John Feb 17 '11 at 16:06

2 Answers2