-1

I want to get all contents of a section tag in an HTML string using perl. I'm using the following line of code, but it doesn't seem to work:

$article_content =~ s/^.*?<section>(.*)<\/section>.*?$/$1/;
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • 2
    Obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – jordanm Dec 23 '12 at 05:28
  • @jordanm Obligatory response: http://stackoverflow.com/a/4234491/211627 – JDB Dec 23 '12 at 05:32
  • Thanks for your comments. Very educational. – Rod Michael Coronel Dec 23 '12 at 05:39
  • @Cyborgx37 I believe that particular post by tchrist contains a certain amount of irony. – TLP Dec 23 '12 at 06:09
  • @TLP - Perhaps, but in my experience it is more effective to say "That way is possible, but extremely difficult. This way is much simpler." then to say "Don't do it that way. Do it this way. This way is better for reasons you can't understand." – JDB Dec 23 '12 at 20:26
  • @Cyborgx37 Either way, it seems clear that the general recommendation is to not write your own regexes. – TLP Dec 23 '12 at 20:43

3 Answers3

1

Change (.*) to (.*?) and see if that helps.

DWright
  • 9,258
  • 4
  • 36
  • 53
  • 1
    Note, however, that this will not work perfectly in all scenarios. E.g. `
    abcxyz
    `
    – JDB Dec 23 '12 at 05:37
  • And, yes my approach will not work given @Cyborgx37 example. But since our company is also generating the content, this shouldn't happen (I hope...) – Rod Michael Coronel Dec 23 '12 at 05:41
  • So what ends up in the capture group for the two variants? i.e., (.*) and (.*?) – DWright Dec 23 '12 at 05:42
  • For this case, after I added the /s option, it's the same. If I understand correctly, (.*?) is the non-greedy way and since there is only one
    tag in our htmls, (.*) and (.*?) has the same result...
    – Rod Michael Coronel Dec 23 '12 at 05:46
  • True, now that I know that you only have one
    . However, for the kind of match (i'm not talking about whether you are working with HTML or not) you are trying to do, you should develop judicious use of (.*?) as a habit. When you know something before and after the group you are trying to match, you almost always want non-greedy.
    – DWright Dec 23 '12 at 05:52
1

Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/perl.html for examples of how to properly parse HTML with Perl modules.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • 1
    This is not exactly right. See http://stackoverflow.com/a/4234491/211627. The correct response should be: "Parsing HTML with regexes is the hard way. Consider something easier, like XYZ." – JDB Dec 23 '12 at 05:36
  • It's close enough for beginners. Tom Christiansen knows the rules so it's OK for him to break them. OP does not. – Andy Lester Dec 23 '12 at 08:02
  • As soon as the HTML changes from your expectations, it will likely break your code no matter how you extract data from the HTML. – ikegami Dec 23 '12 at 13:17
  • What I'm talking about is that OP's code that looks for `
    ` will break if that every becomes `
    ` or `
    `, which are perfectly valid and should not change the behavior of his program.
    – Andy Lester Dec 23 '12 at 16:58
  • Engineers are fascinated with "unsolvable" problems. Tell me that something can't be done, and I'll instinctively try to prove you wrong. But show me that something is extremely difficult with little reward, and I'll follow you anywhere for a simpler solution. – JDB Dec 23 '12 at 20:30
1

The first problem is that you assume . matches any character, but that's only the case when using /s.

ikegami
  • 367,544
  • 15
  • 269
  • 518