4

In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.

In my specific case I have a page of HTML and need to extract all the content between

<w-block-content><span><div>

and

</div></span></w-block-content>

where

  • the elements might have attributes
  • the HTML might be formatted or not - there might be extra white space and newlines
  • there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
    • contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
      • which contains ONLY ONE direct <div> child
        • which wraps the content that must be extracted
  • the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
  • the solution must be pure ECMAScript-spec Regex. No Javascript code can be used

Thus the problem stated in the question at the top.

The following regex successfully matches as long as there are NO internal </div> tags:

(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.

I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.

Here is sample test data:

</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><div><b>Master č. 2</b>                  </div><br>

                  </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>
</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><b>Master č. 2</b><br>
                  
                   </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>

which I've been testing here: (https://regex101.com/r/jekZhr/3

The first extracted chunk should be:


Další master<br><div><b>Master č. 2</b>                  </div><br>

                  

I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.

Inigo
  • 12,186
  • 5
  • 41
  • 70
Radek
  • 13,813
  • 52
  • 161
  • 255
  • pls simplify your working regex match. – SL5net Dec 14 '22 at 10:51
  • 1
    @Wiktor Stribiżew why did you remove the tags? Arent they relevant? – Radek Dec 14 '22 at 11:39
  • Not relevant at all, `regex` is what you are asking about, not these specific entities. – Wiktor Stribiżew Dec 14 '22 at 11:41
  • 1
    you know that regex is a suboptimal tool for parsing html, don't you? you might get it working well for specific scenarios; but be prepared because slight changes in the html, valid html, will break your poor regex. – PA. Dec 14 '22 at 12:06
  • JavaScript, Python, PHP, Java...? – zer00ne Dec 14 '22 at 12:43
  • @zer00ne javascript – Radek Dec 14 '22 at 13:05
  • @PA. I know. It is mentioned in the question. But if my answer got a solution then it would always work. I think – Radek Dec 14 '22 at 13:06
  • get a real html parser and give it try. With something like cheerio, that's a piece of cake. – PA. Dec 14 '22 at 14:40
  • @PA I am using no code platform that got limited functionality. Either I find working regex or I have to update operations and data structure. – Radek Dec 14 '22 at 14:42
  • yep, real world constraints are a such a pain in the ... but, you mentioned javascript, right? – PA. Dec 14 '22 at 14:49
  • there are function available that use regex. On server side it would be java and on client it would be javascript syntax of regex. Meaning that I am not able to use javascript. – Radek Dec 14 '22 at 14:58
  • @Radek, You say " element, which contains ONLY ONE direct child". So can we assume that there is only white space between the target and the ? – Nikkorian Dec 17 '22 at 01:41
  • @Nikkorian That text was written by me to clarify Radek's question. I inferred it from his own Regex pattern. I just now edited by edit to better reflect what I think Radek assumes. If I'm wrong, I expect Radek to correct it soon. – Inigo Dec 17 '22 at 03:43
  • @Nikkorian yes, you are right. I was thinking the same. It would work. But I found this approach after I posted this question and I wanted to know more generic answer. – Radek Dec 17 '22 at 09:20
  • I would use two steps [like this JS demo (onecompiler.com)](https://onecompiler.com/javascript/3ys67tcxu). The [first step (regex101)](https://regex101.com/r/KM0wFr/1) uses a technique similar to the [unrolled star alternation (rexegg)](https://www.rexegg.com/regex-quantifiers.html#unrolled_staralt) to improve performance. – bobble bubble Dec 17 '22 at 15:05
  • Looking at all the answers what's hilarious is how much free labor you are getting trying to meet difficult requirements for a measly 150 reputation . If the answer is used for work in a high paying job then the joke's on us – Inigo Dec 18 '22 at 14:51
  • 1
    @Radek Oh, a single regex... I throw in [this variant](https://regex101.com/r/KM0wFr/4) as comment. – bobble bubble Dec 18 '22 at 17:20

5 Answers5

3

As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div> up to </w-block-content>. From there find the last index of </div> and .slice() it.

RegExp

/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g

regex101

Explanation

A look behind: (?<=...) must precede the match, but will not be included in the match itself.

A look ahead: (?=...) must proceed the match, but will not be included in the match itself.

Segment Description
(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
Find if literal "<w-block-content", then anything, then literal "<div", then anything, then literal ">" is before whatever is matched. Do not include it in the match.
[\s\S]*?
Match anything
(?=<\/w-block-content>)
Find if literal "</w-block-content>" is after whatever is matched. Do not include it in the match.

Example

const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;

const str = document.querySelector("main").innerHTML;

const A = str.match(rgx)[0];

const idx = A.lastIndexOf("</div>");

const X = A.slice(0, idx);

console.log(X);
<main>
  <w-block-content id="A">
    CONTENT OF #A
    <span id="B">
      CONTENT OF #B
      <div id="C">
        <div>CONTENT OF #C</div>
        <div>CONTENT OF #C</div>
      </div>
      CONTENT OF #B
    </span>
    CONTENT OF #A
  </w-block-content>
</main>
zer00ne
  • 41,936
  • 6
  • 41
  • 68
  • the expected result is full content of
    . Try to insert div inside the
    – Radek Dec 14 '22 at 14:22
  • I am not able to use javascript language but javascript regex. But to use your regex and then find the last index of and slice it is a great idea. – Radek Dec 14 '22 at 15:00
  • 1
    That's a harsh limitation, you'll probably better off asking a new question with a Java tag and then emphasize the JavaScript "flavor" regex requirement. Good luck. – zer00ne Dec 14 '22 at 15:06
3

In your pattern you use [\s\S]*? which matches any character, as few as possible. But as you use that part in between the elements, the pattern can backtrack and allow to match the first </div>

If you want to extract the parts that match, and as you already have a pattern that uses a capture group "as long as there are NO internal tags" you don't need any lookarounds.

You can make your pattern more specific and match the opening and closing tags with only optional whitespace chars in between.

<w-block-content[^<>]*>\s*<span[^<>]*>\s*<div[^<>]*>([^]*?)<\/div>\s*<\/span>\s*<\/w-block-content>

Explanation

  • <w-block-content[^<>]*>\s* Match the w-block-content element, where [^<>]* is a negated character class that matches optional chars other than < and >, and the \s* matches optional whitespace chars (including newlines)
  • <span[^<>]*>\s* The same for the span
  • <div[^<>]*> The same for the div
  • ([^]*?) Capture group 1, match any character including newlines, as few as possible
  • <\/div>\s*<\/span>\s*<\/w-block-content> Match then ending part where there can be optional whitespace chars in between the closing tags.

See a regex demo.

See why parsing HTML with a regex is not advisable

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • great, nice. Even with your explanation you think that for this particular case is regex not good to be used? – Radek Dec 17 '22 at 10:22
  • @Radek If you have the availability to use a tool/library that can parse html, I would consider using that instead of a regex, as those are specifically made for it. A regex matches patterns in text, and using it for HTML is very brittle. If the possible text to match is predictable then this may suit your needs. – The fourth bird Dec 17 '22 at 10:45
  • I am not able to use any library. That is the reason why I posted this question. Either I can use regex or I have to change architecture of the "project". From the questions received I can see that there is a solution. Now I have another question..... is the solution going to work 100% ? – Radek Dec 18 '22 at 13:37
  • 1
    In my answer I use `<\/div\s*>(?:(?!<\/div\s*>)[\s\S])*?` to force matching the last ``. I noticed you use `]*>` whereas I use `]*>`. Both are flawed and will "fail" under different circumstances (mine will erroneously match `` and yours will erroneously reject ``. Thinking about it the former (malformed input) is more likely than the latter (a span with such an attribute value), so I think yours is better. But what this really illustrates is why using Regex instead of an HTML parser is an imperfect hack. – Inigo Dec 18 '22 at 14:47
2

Here's the regex that worked for me, when applied to the example you provided; I've broken it out to three separate lines for visual clarity, and presumably you'd combine them back into one line or something:

(?<=<w-block-content[^>]*>\s*<span[^>]*>\s*<div[^>]*>)
[\s\S]*?
(?=<\/div>\s*<\/span>\s*<\/w-block-content>)

I don't think you need to use capture groups () in this case. If you're using a look-behind (?<=) and a look-ahead (?=) for your boundaries-finding (both of which are non-capturing), then you can just let the entire match be the content that you want to find.

I added this answer because I didn't see the other answers using [^>] (= negated character class) to allow the tag strings to be open-ended in accepting additional attributes without entirely skipping any enforcement of tag closure, which I think is a cleaner and safer approach.

I'm admittedly not a JavaScript guy here, so: today I learned that JavaScript regex-matching doesn't support single-line mode (/s), so you have to do those [\s\S] things as a work-around, instead of just .. What a pain that must be for you JavaScript folks... sorry.

mlibby
  • 162
  • 1
  • 8
  • looks good. It seems that it does what I need. Could you please explain why it does use the last as I want but not first one. – Radek Dec 16 '22 at 18:33
2

The following solution assumes that there can only be whitespace and/or newlines between the target </div> and the </span>, which follows from the OP's statement that the <span> only has one direct child and this is the wrapper <div> whose contents we are seeking:

/(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>((?!<\/div>))*[\s]+<\/span>[\s]*?<\/w-block-content>)/gm

https://regex101.com/r/sn0frx/1

EDIT: explanation. This is essentially the OP's regex with the following changes:

  1. A negative lookahead ((?!<\/div>))* is inserted after the pattern's <\/div> to ignore any earlier </div>s.
  2. The OP's character class that now follows this insertion has had the \S removed so is now [\s]*? based on the assumption stated above.
  3. Similarly, the same edit has been made to the character class following the <\/span>, based on the assumption that the </span> we are seeking is the one immediately preceding the </w-block-content>, whitespace and newlines notwithstanding, as indicated in the question.
Nikkorian
  • 770
  • 4
  • 10
  • Here's the problem with using lookaheads: That part of the input will have to be scanned all over again for successive matches. – Inigo Dec 17 '22 at 14:49
2

Pure regex solution that accepts trickier input than the sample data provided in the question.

The code and data snippet at the bottom includes such tricky input. For example, it includes additional (unexpected) non-whitespace within the matching elements that are not part of the extracted data, HTML comments in this case.

I inferred this as a requirement from the original regex provided in the question.

None of the other answers as of this writing can handle this input.

⚠️ It also accepts some illegal input, but that's what you get by requiring the use of regular expressions and disallowing a true HTML parser.

On the other hand, a HTML parser will make it difficult to handle the malformed HTML in the sample input given in the question. A conforming parser will handle such "tag soup" by forcibly matching the tag to an open div further up the tree, prematurely closing any intervening parent elements on along the way. So not only will it use the first rather than last </div> with the data record, it may close higher up container elements and wreak havoc on how the rest of the file is parsed.

The regex

<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>(?:(?!<\/div\s*>)[\s\S])*?<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g

The regex meets all the requirements stated in the question:

  • It is pure Regexp. It requires no Javascript other than the standard code needed to invoke it.
    • It can be invoked in one call via String.matchAll() (returns an array of matches)
    • Or you can iteratively invoke it to iteratively parse records via Regexp.exec(), which returns successive matches on each call, keeping track of where it left off automatically. See test code below.
    • Regex grouping is used so that the entire outer "record" is parsed and consumed but the "data" within is still available separately. Otherwise parsing successive records would require additional Javascript code to set the pointer to the end of the record before the next parse. That would not only go against the requirements but would also result in redundant and inefficient parsing.
      • The full record is available as group 0 of each match
      • The data within is available as group 1 of each match
  • It handles all legal extra whitespace within tags
  • It handles both whitespace and legal non-whitespace between elements (explained above).

In addition:

The regex explained

/
<w-block-content[^>]*> opening w-block-content "record" tag with arbitrary attributes and whitespace
[\s\S]*? arbitrary whitespace and non-whitespace within w-block-content before span
<span[^>]*> expected nested span with arbitrary attributes and whitespace
[\s\S]*? arbitrary whitespace and non-whitespace within span before div
<div[^>]*> expected nested div with arbitrary attributes and whitespace. This div wraps the data.
([\s\S]*?) the data
<\/div\s*> the closing div tag with arbitrary legal whitespace.
(?:(?!<\/div\s*>)[\s\S])*? arbitrary whitespace and non-whitespace within span after div
except that it guarantees that </div> matched by the preceding pattern is the last one within the span element.
<\/span\s*> the closing span tag with arbitrary legal whitespace.
[\s\S]*? arbitrary whitespace and non-whitespace within w-block-content after span
<\/w-block-content\s*> the closing w-block-content tag with arbitrary legal whitespace.
/g global flag that enables extracting multiple matches from the input. Affects how String.matchAll and RegExp.exec work.

Tricky Test Data and Example Usage/Test Code

'use strict'
const input = `<tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112">
                <span class="source-block-tooltip">
                  <div>SIMPLE CASE DATA STARTS HERE

Další master<br><b>Master č. 2</b><br>

                  SIMPLE CASE DATA ENDS HERE</div>
                </span>
              </w-block-content>
            </div>
          </td>
</tr><tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content class="tricky" 
                   data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"  >
                       <!-- TRICKY: whitespace within expected tags above and below,
                        and also this comment inserted between the tags -->
                <span class="source-block-tooltip"
                      color="burgandy"
                      > <!-- TRICKY: some more non-whitespace
                       between expected tags --> 
                  <div
                     >TRICKY CASE DATA STARTS HERE
                     <div> TRICKY inner div

Další master<br><b>Master č. 2</b><br>
                     </div>
                     TRICKY unmatched closing div tags
                     </div> Per the requirements, THIS closing div tag should be ignored and
                     the one below (the last one before the closing outer tags) should be 
                     treated as the closing tag.
                  TRICKY CASE DATA ENDS HERE</div> TRICKY closing tags can have whitespace including newlines
                  <!-- TRICKY more stuff between closing tags -->
                </span
                   >
                <!-- TRICKY more stuff between closing tags -->
              </w-block-content
                 >
            </div>
          </td>
</tr>
`

const regex = /<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>((?:(?!<\/div\s*>)[\s\S])*?)<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g

function extractNextRecord() {
    const match = regex.exec(input)
    if (match) {
        return {record: match[0], data: match[1]}
    } else {
        return null
    }
}

let output = '', result, count = 0
while (result = extractNextRecord()) {
    count++
    console.log(`-------------------- RECORD ${count} -----------------------\n${result.record}\n---------------------------------------------------\n\n`)    
    output += `<hr><pre>${result.data.replaceAll('<', '&lt;')}</pre>`
}
output += '<hr>'
output = `<p>Extracted ${count} records:</p>` + output
document.documentElement.innerHTML = output
Inigo
  • 12,186
  • 5
  • 41
  • 70
  • Such a great answer. Answered and explained in such a depth. Thank you so much for that. You are right about illegal or bad formatted input. It could happen as I do not have any control of the HTML code I want to extract. I do have around the other structure but not the text I need to extract. – Radek Dec 18 '22 at 13:41
  • ok, I'm relieved to know that my edits to your question were accurate to your intent/requirements. I recommend that you update the sample input in the question to one that has more "stuff". You are welcome to copy the one from my answer if looks like the right test to you. Though let me know as I'd have to change the title of my answer – Inigo Dec 18 '22 at 14:10