Extract section contents from HTML using perl

Question

I want to get all contents of a section tag in an HTML string using perl. I'm using the following line of code, but it doesn't seem to work:

$article_content =~ s/^.*?<section>(.*)<\/section>.*?$/$1/;

Obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — jordanm, Dec 23 '12 at 05:28
@jordanm Obligatory response: http://stackoverflow.com/a/4234491/211627 — JDB, Dec 23 '12 at 05:32
@Cyborgx37 I believe that particular post by tchrist contains a certain amount of irony. — TLP, Dec 23 '12 at 06:09
@TLP - Perhaps, but in my experience it is more effective to say "That way is possible, but extremely difficult. This way is much simpler." then to say "Don't do it that way. Do it this way. This way is better for reasons you can't understand." — JDB, Dec 23 '12 at 20:26
@Cyborgx37 Either way, it seems clear that the general recommendation is to not write your own regexes. — TLP, Dec 23 '12 at 20:43

score 1 · Answer 1 · answered Dec 23 '12 at 05:27

1

Change (.*) to (.*?) and see if that helps.

answered Dec 23 '12 at 05:27

DWright

9,258
4
36
53

1

Note, however, that this will not work perfectly in all scenarios. E.g. `
abcxyz
` – JDB Dec 23 '12 at 05:37
And, yes my approach will not work given @Cyborgx37 example. But since our company is also generating the content, this shouldn't happen (I hope...) – Rod Michael Coronel Dec 23 '12 at 05:41
So what ends up in the capture group for the two variants? i.e., (.*) and (.*?) – DWright Dec 23 '12 at 05:42
For this case, after I added the /s option, it's the same. If I understand correctly, (.*?) is the non-greedy way and since there is only one
tag in our htmls, (.*) and (.*?) has the same result...
– Rod Michael Coronel Dec 23 '12 at 05:46
True, now that I know that you only have one
. However, for the kind of match (i'm not talking about whether you are working with HTML or not) you are trying to do, you should develop judicious use of (.*?) as a habit. When you know something before and after the group you are trying to match, you almost always want non-greedy.
– DWright Dec 23 '12 at 05:52

score 1 · Answer 2 · answered Dec 23 '12 at 05:28

1

Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/perl.html for examples of how to properly parse HTML with Perl modules.

answered Dec 23 '12 at 05:28

Andy Lester

91,102
13
100
152

1

This is not exactly right. See http://stackoverflow.com/a/4234491/211627. The correct response should be: "Parsing HTML with regexes is the hard way. Consider something easier, like XYZ." – JDB Dec 23 '12 at 05:36
It's close enough for beginners. Tom Christiansen knows the rules so it's OK for him to break them. OP does not. – Andy Lester Dec 23 '12 at 08:02
As soon as the HTML changes from your expectations, it will likely break your code no matter how you extract data from the HTML. – ikegami Dec 23 '12 at 13:17
What I'm talking about is that OP's code that looks for `
` will break if that every becomes `
` or `
`, which are perfectly valid and should not change the behavior of his program.
– Andy Lester Dec 23 '12 at 16:58
Engineers are fascinated with "unsolvable" problems. Tell me that something can't be done, and I'll instinctively try to prove you wrong. But show me that something is extremely difficult with little reward, and I'll follow you anywhere for a simpler solution. – JDB Dec 23 '12 at 20:30

score 1 · Answer 3 · answered Dec 23 '12 at 13:16

1

The first problem is that you assume . matches any character, but that's only the case when using /s.

answered Dec 23 '12 at 13:16

ikegami

367,544
15
269
518

Extract section contents from HTML using perl

3 Answers3