-1

I want to do a regex in JAVA that returns me all between the tag , but i want exclude the tag. I have this code:

Pattern pattern = Pattern.compile("(?s)<body(\\s|\\S)*>(\\s|\\S)*</body>");
Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(0));

with this on my str variable

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>dpioaushd iuashdiu ashd</p>
    <p> has</p>
    <p>ud ashuod sh</p>
    <p>odu sad ha</p>
    <p>suod sh</p>
    <p>od uashod uahd<br>
    </p>
    <div class="moz-signature">-- <br>
      <img src="cid:part1.8C289150.C3F89C42@wssim.com.br" border="0"></div>
  </body>
</html>

And that is my return

   <body text="#000000" bgcolor="#FFFFFF">
        <p>dpioaushd iuashdiu ashd</p>
        <p> has</p>
        <p>ud ashuod sh</p>
        <p>odu sad ha</p>
        <p>suod sh</p>
        <p>od uashod uahd<br>
        </p>
        <div class="moz-signature">-- <br>
          <img src="cid:part1.8C289150.C3F89C42@wssim.com.br" border="0"></div>
      </body>

but I want this return:

    <p>dpioaushd iuashdiu ashd</p>
    <p> has</p>
    <p>ud ashuod sh</p>
    <p>odu sad ha</p>
    <p>suod sh</p>
    <p>od uashod uahd<br>
    </p>
    <div class="moz-signature">-- <br>
      <img src="cid:part1.8C289150.C3F89C42@wssim.com.br" border="0"></div>

How can I do to exclude the tag body with my macher?

gFontaniva
  • 897
  • 11
  • 27
  • 2
    [This](https://stackoverflow.com/a/1732454/17300) – Stephen P Jul 13 '17 at 17:27
  • 3
    Don’t use regular expressions for that. See https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg . – VGR Jul 13 '17 at 17:30
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Aleksandr M Jul 13 '17 at 17:31
  • You picked a regex solution that doesn't work.. `-1`. You shouldn't use regex for this anyway, but if you have to, better use one that at least works. –  Jul 13 '17 at 20:10

2 Answers2

0

This worked for me for your input string. But again, this will not work for things like CDATA block inside HTML as suggested by dcsohl nor for many other strings as HTML is wickedly difficult to parse with Regex.

Pattern pattern = Pattern.compile("(<body .*>)((\\s|\\S)*)(</body>)");
Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(2));

Output:

<p>dpioaushd iuashdiu ashd</p>
<p> has</p>
<p>ud ashuod sh</p>
<p>odu sad ha</p>
<p>suod sh</p>
<p>od uashod uahd<br>
</p>
<div class="moz-signature">-- <br>
  <img src="cid:part1.8C289150.C3F89C42@wssim.com.br" border="0"></div>
Piyush
  • 1,162
  • 9
  • 17
0

If you're worried about body being inside other content (@dcsohl), this works.
Body content is in capture group 2.

https://regex101.com/r/eafiRS/6

"(?><(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?(?!body)[\\w:]+\\s*/?)|(?:(?!body)[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>|[\\S\\s])*?<body(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>((?:<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\3\\s*(?=>))|(?:/?(?!body)[\\w:]+\\s*/?)|(?:(?!body)[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>|[\\S\\s])*)</body\\s*>"

Expanded

 (?>                           # Before body
      <
      (?:
           (?:
                (?:
                                                   # Invisible content; end tag req'd
                     (                             # (1 start)
                          script
                       |  style
                       |  object
                       |  embed
                       |  applet
                       |  noframes
                       |  noscript
                       |  noembed 
                     )                             # (1 end)
                     (?:
                          \s+ 
                          (?>
                               " [\S\s]*? "
                            |  ' [\S\s]*? '
                            |  (?:
                                    (?! /> )
                                    [^>] 
                               )?
                          )+
                     )?
                     \s* >
                )

                [\S\s]*? </ \1 \s* 
                (?= > )
           )

        |  (?:
                /? 
                (?! body )
                [\w:]+ \s* /? 
           )
        |  (?:
                (?! body )
                [\w:]+ 
                \s+ 
                (?:
                     " [\S\s]*? " 
                  |  ' [\S\s]*? ' 
                  |  [^>]? 
                )+
                \s* /?
           )
        |  \? [\S\s]*? \?
        |  (?:
                !
                (?:
                     (?: DOCTYPE [\S\s]*? )
                  |  (?: \[CDATA\[ [\S\s]*? \]\] )
                  |  (?: -- [\S\s]*? -- )
                  |  (?: ATTLIST [\S\s]*? )
                  |  (?: ENTITY [\S\s]*? )
                  |  (?: ELEMENT [\S\s]*? )
                )
           )
      )
      >
   |  
      [\S\s] 

 )*?


 < body                        # Open body
 (?:
      \s+ 
      (?>
           " [\S\s]*? "
        |  ' [\S\s]*? '
        |  (?:
                (?! /> )
                [^>] 
           )?
      )+
 )?
 \s* >

 (                             # (2 start), Body content
      (?:
           <
           (?:
                (?:
                     (?:
                          # Invisible content; end tag req'd
                          (                             # (3 start)
                               script
                            |  style
                            |  object
                            |  embed
                            |  applet
                            |  noframes
                            |  noscript
                            |  noembed 
                          )                             # (3 end)
                          (?:
                               \s+ 
                               (?>
                                    " [\S\s]*? "
                                 |  ' [\S\s]*? '
                                 |  (?:
                                         (?! /> )
                                         [^>] 
                                    )?
                               )+
                          )?
                          \s* >
                     )

                     [\S\s]*? </ \3 \s* 
                     (?= > )
                )

             |  (?:
                     /? 
                     (?! body )
                     [\w:]+ \s* /? 
                )
             |  (?:
                     (?! body )
                     [\w:]+ 
                     \s+ 
                     (?:
                          " [\S\s]*? " 
                       |  ' [\S\s]*? ' 
                       |  [^>]? 
                     )+
                     \s* /?
                )
             |  \? [\S\s]*? \?
             |  (?:
                     !
                     (?:
                          (?: DOCTYPE [\S\s]*? )
                       |  (?: \[CDATA\[ [\S\s]*? \]\] )
                       |  (?: -- [\S\s]*? -- )
                       |  (?: ATTLIST [\S\s]*? )
                       |  (?: ENTITY [\S\s]*? )
                       |  (?: ELEMENT [\S\s]*? )
                     )
                )
           )
           >
        |  
           [\S\s] 

      )*
 )                             # (2 end)
 < / body \s* >                # Close body
  • Adding a comment (immediately after ``) caused Group1 to capture the wrong thing. https://regex101.com/r/eafiRS/5 – Stephen P Jul 13 '17 at 18:56
  • @StephenP - Ahh, yes .. Well, it's longer then https://regex101.com/r/eafiRS/6. Must get from beginning as well. –  Jul 13 '17 at 19:11