Regex to keep all but DIV contents

Question

I am using jEdit, and I have a bunch of badly coded HTML files of which I want to grab the main contents of and not the surrounding HTML.

I need everything in between <div class="main-text"> and the next </div>.

There must be a REGEX way of doing this, jEdit allows me to replace and find with regular expressions.

I am not profficient with regex and it would take me a long time to work it out - can anyone help quick please?

score 1 · Accepted Answer · edited May 23 '17 at 10:34

1

Taking your question literally, you can replace:

/.*<div class="main-text">(.*?)<\/div>.*/

with \1 (or $1 depending on what your editor uses).

However, The Pony He Comes to bite you, because what if your "main-text" element contains another <div>? If you're sure this will not happen, then you're fine. Otherwise, you're in truble. It may be easier to replace /.*<div class="main-text">/ with the empty string, then manully look for the end and delete everything after.

For that matter, this task may be easiest to do manually, so you don't have to double-check after your code has run.

edited May 23 '17 at 10:34

Community

1
1

answered Jan 22 '13 at 14:35

Niet the Dark Absol

320,036
81
464
592

LOL what the HECK is that pony he comes stuff? – Chud37 Jan 22 '13 at 14:36
It is the final throes of a valiant warrior who has been asked to parse HTML with Regex one too many times. RIP, valiant warrior, your message shall live on forever. – Niet the Dark Absol Jan 22 '13 at 14:37
Heh :) that was pretty cool. Unfortunately jEdit returned no results from your regex. – Chud37 Jan 22 '13 at 14:40
And you bloody well should be. Use a proper parser. ;) – fgysin Jan 22 '13 at 15:53
One fundamental distinction should be made between "using regex to parse html" and "using regex to MATCH something inside some html file(s)". Depending on scope, the Pony might or might not come. – heltonbiker Jan 22 '13 at 16:38

Minko Gechev · Answer 2 · 2013-01-22T14:40:06.437

This regex should solve your problem: /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi

Here is an example in Perl:

my $str = '<div class="main-text"> and the next </div>';
$str =~ /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi;
print $1;

The example is in Perl, but the regular-expression can be applied language independently.

Here is explanation of the regex:

/       -start of the regex
   <\s*    -we can have < and whitespace after it
      div     -matches "div"
         \s+     -matches one or more whitespaces after the <div
         class="main-text"    -matches class="main-text" (so <div class="main-text" to here)
         [^>]*       -matches everything except >, this is because you may have more attributes of the div
         >          -matches >, so <div class="main-text"> until now
      (.*?)        -matches everything until </div> and saves it in $1
   <\/div>        -matches </div>, so now we have <div class="main-text">( and the next )</div> until now
/gi       -makes the regex case insensitive

Yes but nothing stops us to insure ourself. Not every HTML is valid. — Minko Gechev, Jan 22 '13 at 14:40
What if the text content includes something like `profit < dividend`, what then? — Niet the Dark Absol, Jan 22 '13 at 14:41

score 0 · Answer 3 · answered Jan 22 '13 at 16:30

This regex capture text between html tag

<(?<tag>div).*?>(?<text>.*)</\k<tag>>

décomposition :

<(?div).*?> : the first open tag with div, this group is called "tag"
(?.*) : the text capture between the tags
> : the ending div tag, back reference to the group called "tag"

finally, the results of the capture give two groups "tag" and "text", your capture is in "text"

Regex to keep all but DIV contents

3 Answers3