0

Match or Find html/xml Element using RegExp Regexp to find html/xml element

Here I want to find the html or xml element with id or without id attribute.

Sample html file:

<p class="txt-ni">Radiation absorbed dose to the red bone marrow, a critical organ in the therapy of differentiated thyroid carcinoma with I-131 (radioiodine), cannot be measured directly. As radioiodine concentration is comparable in blood and most organs (<a href="#bib5" id="bib_5">Kolbert <em>et al</em>. 2007</a>), and is believed to be similar in red marrow (<a href="#bib9" id="bib_9">Sgouros 2005</a>), the absorbed dose to the blood seems to be a good first-order approximation of the radiation absorbed dose to the hematopoietic system and a better means to quantify exposure from therapy than the total amount of activity administered.</p>

In this above sample is single line(without enter mark) and it contain two <a> tag now I want to find the each <a> to </a> separately.

Here I am used RegExp

<a href="#([^"]*)" id="([^"]*)">(.*)</a>

The above RegExp will math all the <a> tag in the line, I mean the above RegExp returen following mathch

<a href="#bib5" id="bib_5">Kolbert <em>et al</em>. 2007</a>), and is believed to be similar in red marrow (<a href="#bib9" id="bib_9">Sgouros 2005</a>

But I want to match each separately like below

1. <a href="#bib5" id="bib_5">Kolbert <em>et al</em>. 2007</a>

2. <a href="#bib9" id="bib_9">Sgouros 2005</a>

I hope you will understand my request

Note:

The element may also contain child element as <i>,<em>,<b>

1 Answers1

0

Try replacing your regular expression with:

<a href="#([^"]*?)" id="([^"]*?)">(.*?)</a>

The question marks after the * symbols will tell the regex engine to find as few occurrences as possible.

You may find this page informative on the subject: http://www.regular-expressions.info/repeat.html

ps.pf
  • 79
  • 1
  • 3
  • Like all attempts to parse XML (or HTML) using regular expressions, this is WRONG. I can see at least three bugs in it without really trying: it requires the attributes to be in a particular order, it requires whitespace between the attributes in exactly the right place, and it requires the attribute values to be enclosed in double quotes rather than single quotes – Michael Kay Aug 04 '15 at 07:42
  • Sorry about that. My answer was a simple modification to the OP's regex pattern to make it work for his use case. Of course, to make it generic would require more work :) – ps.pf Sep 02 '15 at 10:20