0

I need regex that will match everything not in <div> tag. For example:

foobar<p>lol</p><div>something</div>blahblah

Should match foobar<p>lol</p> and blahblah

Mat
  • 202,337
  • 40
  • 393
  • 406
  • 4
    May I have the honour? [Thou shalt not try to parse HTML with a regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Use an XML parser instead, like [lxml](http://lxml.de/) for Python. – Manuel Leuenberger Jun 25 '11 at 11:29
  • 2
    [HTML::TreeBuilder::XPath](http://search.cpan.org/perldoc?HTML::TreeBuilder::XPath) is a good choice in Perl land. – Quentin Jun 25 '11 at 12:34

3 Answers3

3

As Mat and maenu pointed out already, using regexps to parse HTML is –to say the least– error prone. Since you tagged your question with the perl tag, I'll give you a small example using HTML::TokeParser::Simple, which I think is a good choice for these kinds of manipulation.

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new( *DATA );

my $is_in_div;
while ( my $token = $parser->get_token ) {
    if ( $token->is_start_tag( 'div' ) ) {
        $is_in_div++;
        next;
    }
    if ( $token->is_end_tag( 'div' ) ) {
        $is_in_div--;
        next;
    }
    print $token->as_is if not $is_in_div;
}   

__DATA__
foobar<p>lol</p><div>something</div>blahblah
foobar<p>lol</p><div>more stuff<div>something</div></div>blahblah
cjm
  • 61,471
  • 9
  • 126
  • 175
larsen
  • 1,431
  • 2
  • 14
  • 26
0

Not sure what you're trying to accomplish, and a big caveat that this won't work on all HTML (see here), but the following might do the trick:

#!/opt/perl/bin/perl

use strict;
use warnings;
use 5.010;

my $html = 'foobar<p>lol</p><div>something</div>blahblah';

my @fragments = split(m{<div\b[^>]*>.*?</div>}is, $html);
say foreach @fragments;

see perldoc -f split and perldoc perlre for more info.

Community
  • 1
  • 1
mscha
  • 6,509
  • 3
  • 24
  • 40
-1

Select *:not(div).

daxim
  • 39,270
  • 4
  • 65
  • 132
  • Regular expressions don't use CSS selectors. – Quentin Jun 25 '11 at 12:33
  • 2
    Even if they did, that would find everything that **is not** a div, and the question is looking for everything that **is not inside** a div. – Quentin Jun 25 '11 at 12:35
  • 2
    — so **1** that isn't what the question is asking for, **2** a selector is useless without a selector engine and you don't even bother to suggest one that could be used, and **3** that selector would be insufficient anyway. – Quentin Jun 25 '11 at 12:36