Extracting content of HTML tag with specific attribute

Question

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?

This is what I currently have:

<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>

The problem with this is this sample:

<div id="1">test</div><div id="2">test</div>

If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.

How can I do this?

See [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). And `(.|\n)*?` is something that is most likely to cause a huge slowdown. — Wiktor Stribiżew, Jun 09 '17 at 20:11
Why are you using a regular expression for this, instead of using functions like `document.getElementById()`? — Barmar, Jun 09 '17 at 20:13
Becuase it's not exacly a correct HTML, but instead some internal templating engine, which I can't parse with a HTML parser. — khernik, Jun 09 '17 at 20:14
`${value}` is any numeric value, which I want to find as ID attribute — khernik, Jun 09 '17 at 20:15
Use a DOM parser... don't use regex for this task. A DOM parser *can* handle your invalid HTML in some cases. — Brad, Jun 09 '17 at 20:28

score 1 · Answer 1 · 2017-06-09T20:48:27.213

1

A fairly simple way is to use

Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>

Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/

Use the variable in place of 2.

The content will be in group 1.

edited Jun 09 '17 at 20:48

answered Jun 09 '17 at 20:24

Added a delimited version – Jun 09 '17 at 20:48

score 0 · Answer 2 · answered Jun 09 '17 at 20:17

0

Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.

<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>

Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.

However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

answered Jun 09 '17 at 20:17

Barmar

741,623
53
500
612

I can't, "Becuase it's not exacly a correct HTML, but instead some internal templating engine, which I can't parse with a HTML parser." ;) – khernik Jun 09 '17 at 20:19

Extracting content of HTML tag with specific attribute

2 Answers2