Best Regular Expressions Approach

Question

I have to parse an HTML text and take out all the iframe, script and input elements and width attributes from table tr td.Finally i have to look for tr widthout td inside(nested)

My regular expression is something like this:

<tr>[^<td>]*<\/tr>|<script[^<]*>.*[\s\S]*<\/script>|
<iframe[^<]*>.*[\s\S]*<\/iframe>|
 <(?:table|td|tr)[^<>]+style\s*=\s*(?:"|').*width(?:=|\:)\w*\W?(?:"|')|<(?:table|td|tr)  [^<>]+width\s*(?:=|:)\s*(?:"|')?\w*(?:"|')?

The first look for TR without nested TD , the second looks for iframes elements and the third one looks for TABLE|TD|TR with style attributes containing width style or TABLE|TD|TR directly with the width attribute

My problem:

I am using the following Javascript code generated by the regex101.com :

    while ((m = re.exec(st)) != null) {
            if (m.index === re.lastIndex) {
                re.lastIndex++;
            }

if(m.search(...)){}else if(m.search(...))else ...

The problem is that inside the if statements i have to know which pattern was find , Was the TR without nested TD ? Was the Iframe ? Was the width attribute ? How can i optimize the code without have to use this kinf of logic ? Capturing Groups ?

This text is inserted in a textarea field of a web page, so sometimes could be just normal text, the problem is when users copy and paste html code without knowing into the textarea.

parse the DOM as is; esp. since it is javascript. regex would overcomplicate things — hjpotter92, Nov 21 '14 at 11:43
The reason you're having trouble is because you're trying to mow the lawn (parse HTML) with a screwdriver (regexps). Parse HTML with an HTML parser. Navigate and manipulate HTML with the HTML DOM. Don't think of the DOM as a string. — , Nov 21 '14 at 11:46
The problem is that i have to analyse the text copied into a textarea, sometimes this text is not an HTML text sometimes is because the users copy and paste the all page without knowing — tt0686, Nov 21 '14 at 11:54
I fail to see why this is such a special case where you have to analyze the text with regex. Just properly escape the text when inserted into the database, and when displayed on the page. — Uyghur Lives Matter, Nov 21 '14 at 13:13

score 1 · Answer 1 · edited May 23 '17 at 11:57

1

You can't parse HTML with regex. If you are using JavaScript you might consider using a documentFragment to manipulate DOM elements.

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 11:57

Community

1
1

answered Nov 21 '14 at 11:49

Sam Greenhalgh

5,952
21
37

The problem is that i have to analyse the text copied into a textarea, sometimes this text is not an HTML text sometimes is because the users copy and paste the all page without knowing – tt0686 Nov 21 '14 at 11:55

Best Regular Expressions Approach

1 Answers1