0

I am using jEdit, and I have a bunch of badly coded HTML files of which I want to grab the main contents of and not the surrounding HTML.

I need everything in between <div class="main-text"> and the next </div>.

There must be a REGEX way of doing this, jEdit allows me to replace and find with regular expressions.

I am not profficient with regex and it would take me a long time to work it out - can anyone help quick please?

Chud37
  • 4,907
  • 13
  • 64
  • 116

3 Answers3

1

Taking your question literally, you can replace:

/.*<div class="main-text">(.*?)<\/div>.*/

with \1 (or $1 depending on what your editor uses).

However, The Pony He Comes to bite you, because what if your "main-text" element contains another <div>? If you're sure this will not happen, then you're fine. Otherwise, you're in truble. It may be easier to replace /.*<div class="main-text">/ with the empty string, then manully look for the end and delete everything after.

For that matter, this task may be easiest to do manually, so you don't have to double-check after your code has run.

Community
  • 1
  • 1
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • LOL what the HECK is that pony he comes stuff? – Chud37 Jan 22 '13 at 14:36
  • It is the final throes of a valiant warrior who has been asked to parse HTML with Regex one too many times. RIP, valiant warrior, your message shall live on forever. – Niet the Dark Absol Jan 22 '13 at 14:37
  • Heh :) that was pretty cool. Unfortunately jEdit returned no results from your regex. – Chud37 Jan 22 '13 at 14:40
  • And you bloody well should be. Use a proper parser. ;) – fgysin Jan 22 '13 at 15:53
  • One fundamental distinction should be made between "using regex to parse html" and "using regex to MATCH something inside some html file(s)". Depending on scope, the Pony might or might not come. – heltonbiker Jan 22 '13 at 16:38
0

This regex should solve your problem: /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi

Here is an example in Perl:

my $str = '<div class="main-text"> and the next </div>';
$str =~ /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi;
print $1;

The example is in Perl, but the regular-expression can be applied language independently.

Here is explanation of the regex:

/       -start of the regex
   <\s*    -we can have < and whitespace after it
      div     -matches "div"
         \s+     -matches one or more whitespaces after the <div
         class="main-text"    -matches class="main-text" (so <div class="main-text" to here)
         [^>]*       -matches everything except >, this is because you may have more attributes of the div
         >          -matches >, so <div class="main-text"> until now
      (.*?)        -matches everything until </div> and saves it in $1
   <\/div>        -matches </div>, so now we have <div class="main-text">( and the next )</div> until now
/gi       -makes the regex case insensitive
Minko Gechev
  • 25,304
  • 9
  • 61
  • 68
0

This regex capture text between html tag

<(?<tag>div).*?>(?<text>.*)</\k<tag>>

décomposition :

  1. <(?div).*?> : the first open tag with div, this group is called "tag"
  2. (?.*) : the text capture between the tags
  3. > : the ending div tag, back reference to the group called "tag"

finally, the results of the capture give two groups "tag" and "text", your capture is in "text"

mdelpeix
  • 177
  • 8