1

Given the following example code:

bla bla 
<div class="a">
    <div class="b">beta</div> 
    bla bla bla 
    <div class="c">charlie</div> 
    <b>bold</b> 
    etc ... 
</div>

How do I extract the content of the tag <div class="a">. Please note there are an unknown number of similar tags nested inside the parent tag. A simple regex like:

<div class="a">(.*?)</div> 

does not work because it will return:

<div class="b">beta

instead of the actual contents of the tag.

The regex should somehow count the number of opening and closing div tags to determine where to stop. I am not sure this is even possible in regex hence my question.

Update: My question is not on how to extract a tags data by regex in general. My question is how to make sure all tag contents is extracted (like a html parser).

Miguel-F
  • 13,450
  • 6
  • 38
  • 63
Nebu
  • 1,753
  • 1
  • 17
  • 33
  • Maybe you want to use a HTML parser instead. See also this [answer](http://stackoverflow.com/a/590789/3895469). – oddRaven May 17 '17 at 09:27
  • Possible duplicate of [RegEx match content inside div with specific class](http://stackoverflow.com/questions/22743495/regex-match-content-inside-div-with-specific-class) – Mistalis May 17 '17 at 09:27
  • @oddRaven This would probably be the best option. However unfortunately coldfusion 9 does not contain an html parser. – Nebu May 17 '17 at 09:29
  • @Mistalis my question has little simularity with this other question. – Nebu May 17 '17 at 09:31
  • I think you need recursive regex for what you want. Does coldfusion support it ? – Gawil May 17 '17 at 09:31
  • @Gawil Coldfusion is based on Java. As far as I know java does not support this, but please correct me if I am wrong. – Nebu May 17 '17 at 09:36
  • Wait, the `div`s are nested here. You cannot use a regex then. – Wiktor Stribiżew May 17 '17 at 09:36
  • You're right, Java does not support recursion... So coldfusion shouldn't either I guess. I can't see any other way to do what you want with regex, sorry... – Gawil May 17 '17 at 09:37
  • I hope [this article](https://www.bennadel.com/blog/779-parsing-html-tag-data-into-a-coldfusion-structure.htm) will be of help. – Wiktor Stribiżew May 17 '17 at 09:45
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jan May 17 '17 at 09:54
  • @WiktorStribiżew Thanks for pointing that article out. I actually read this article before posting my question here. The regex used does not account for nested similar tags. – Nebu May 17 '17 at 10:00
  • 2
    _(like a html parser)_ well just use an HTML parser. That's what they are designed for. A good one that is open source is [jSoup](https://jsoup.org/). And here is an article from Ben Nadel on using it with ColdFusion - [Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup](https://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm). – Miguel-F May 17 '17 at 15:26
  • I also highly recommend jsoup over regex when it comes to dealing with HTML... (I'm referenced in Ben Nadel's article.) It will auto-correct/normalize incorrectly nested HTML and you have a lot more control over it. You can remove sub-elements, identify images/URLs/headers/anything, remove styles, add classes, sanitize, inject HTML blocks, etc. Finding "div.a" in an HTML document is similar to jQuery: "fragment = myHTML.select('div.a').first().toString();" – James Moberg May 17 '17 at 20:17

1 Answers1

1

It is not possible to fully parse html with normal regex without some extensions.

Using regular expressions to parse HTML: why not?

With that said, you could parse the html yourself or use something like jSoup.

https://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm

Community
  • 1
  • 1
Dan Roberts
  • 4,664
  • 3
  • 34
  • 43
  • 1
    http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not is more or less the same question I have I installed Jsoup as you, Miguel-F and James Moberg suggested and it works pretty well. – Nebu May 19 '17 at 12:30