0

I have the following HTML code:

<div id="page126-div" style="position:relative;width:918px;height:1188px;">
</div>

<div id="page127-div" style="position:relative;width:918px;height:1188px;">
sometext for example
</div>

<div id="page128-div" style="position:relative;width:918px;height:1188px;">
</div>

My task is to match empty divs. Empty means in this context that they do not content at all (no characters between open > and closing <) or contain just newline, or just a space or newline or less than 5 characters. So emptyness is pretty fuzzy.

If I would match all divs, not only empty I would use the following regex:

\<div id="page.*?"\>.*?\<\/div\>

Naturally I should use it with dotall modifier.

But when I try to match only empty divs I try to use this expression:

\<div id="page.*?"\>.{0,5}?\<\/div\>

I expect to get first and last(third) divs, because they contain: opening div tag with attributes, then div content that can be from 0 to 5 characters and closing div tag. First match is right, but second match is second and third divs stacked together instead of third div only. I do not understand why.

Sergey Kravchenko
  • 957
  • 1
  • 8
  • 21
  • 1
    Use a parser? [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) Also, you didn't specify a language. – ctwheels Feb 22 '18 at 18:34
  • Between here '
    ' is a style attribute in the source. So, it really never matches.
    –  Feb 22 '18 at 19:12

2 Answers2

1

This regex is pretty straight-forward:

<div id=\"[^"]+?\" style=[^>]+?>(\s|\n|[^\n]{,5})<\/div>

Just notice it doesn't necessarily requires the exact same id and style properties.

GalAbra
  • 5,048
  • 4
  • 23
  • 42
0

You can give this a try.

Scraper Series

/(?><div(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sid\s*=\s*(?:(['"])\s*page(?:(?!\1)[\S\s])*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+>)\s*[\S\s]{0,5}\s*<\/div\s*>/

https://regex101.com/r/x8jf8D/1

Formatted

 (?>
      < div                  # div tag

      (?=                    # Asserttion (a pseudo atomic group)
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s id \s* = \s* 
           (?:
                ( ['"] )               # (1), Quote
                \s* page               # With 'id = "page XXX"
                (?:
                     (?! \1 )
                     [\S\s] 
                )*
                \1 
           )
      )
      \s+      
      (?:
           " [\S\s]*? "
        |  ' [\S\s]*? '
        |  (?:
                (?! /> )
                [^>] 
           )?
      )+
      >
 )

 \s*                    # Optional whitespaces (remove if necessary)
 [\S\s]{0,5}            # Optional 1-5 anything (including wsp)
 \s*                    # Optional whitespaces  (remove if necessary)

 </div \s* >