I have to parse an HTML text and take out all the iframe, script and input elements and width attributes from table tr td.Finally i have to look for tr widthout td inside(nested)
My regular expression is something like this:
<tr>[^<td>]*<\/tr>|<script[^<]*>.*[\s\S]*<\/script>|
<iframe[^<]*>.*[\s\S]*<\/iframe>|
<(?:table|td|tr)[^<>]+style\s*=\s*(?:"|').*width(?:=|\:)\w*\W?(?:"|')|<(?:table|td|tr) [^<>]+width\s*(?:=|:)\s*(?:"|')?\w*(?:"|')?
The first look for TR without nested TD , the second looks for iframes elements and the third one looks for TABLE|TD|TR with style attributes containing width style or TABLE|TD|TR directly with the width attribute
My problem:
I am using the following Javascript code generated by the regex101.com :
while ((m = re.exec(st)) != null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
if(m.search(...)){}else if(m.search(...))else ...
The problem is that inside the if statements i have to know which pattern was find , Was the TR without nested TD ? Was the Iframe ? Was the width attribute ? How can i optimize the code without have to use this kinf of logic ? Capturing Groups ?
This text is inserted in a textarea field of a web page, so sometimes could be just normal text, the problem is when users copy and paste html code without knowing into the textarea.