I need to use javascript to replace the src attribute of an iframe using javascript to analyze raw (text-based) code (before injected into the DOM). A simple version could use a regular expression like this:
/<iframe\s*[^>]*src\s*=\s*(\"[^\"]*\"|\'[^\']*\'|[^\s>]*)/
Note: I'm only interested in iframes that have the src attribute set.
This will probably work in most cases, but the > character can occur within a string literal, for example <iframe id="pointforward\>" src="http://www.test.se">
which will cause the simple regular expression to fail to match (granted not something you would see much, but allowed and thus such variants can occur).
I came up with the following more complex version:
/<iframe\s+(?:\b(?:(src)|\w+)\b\s*=\s*((?:\"(?:\\[\s\S]|[^\"\\])*\")|(?:\'(?:\\[\s\S]|[^\'\\])*\')|[^\s>]+)\s*|\b(\w+)\b\s*)*>/
Breakdown:
<iframe\s+
(?:
\b(?:(src)|\w+)\b "src" or any attribute name (capture if src)
\s*=\s* equals possible surrounded with whitespace
((?: Capture group for value
\"(?:\\[\s\S]|[^\"\\])*\" value enclosed in double quotes
)|(?: OR
\'(?:\\[\s\S]|[^\'\\])*\' value enclosed in single quotes
)| OR
[^\s>]+ value with no quotes
)\s*| OR
\b(\w+)\b\s* standalone attribute
)*> 0..N attributes and closing >
Description of regular expression functioning: The regular expression is intended to match iframe tags with a variable number of attributes. Attributes can be name=value pairs or standalone (such as seamless). The name=value pairs can have the value enclosed in double quotes, single quotes or no quotes - and the text between the quotes can contained escaped characters (including escaped double and single quotes). Capturing groups capture the src attribute name in the first capturing group and the value in the second. I will use the capturing groups to extract the src attribute values.
I am looking for advice on the quality of the regular expressions and on which to choose. The simple one risks to miss some unusually formatted iframe tags (and maybe some other also - please let me know if there are other problems I should consider). The complex one I don't really feel confident with - although I did write it myself I don't feel on top of it and wonder if I've really covered all the alternative ways an iframe tag can be formatted. Also, being more complex, it might be sluggish running. I've tested the two version only with simple code variations so far, and will obviously have to do more testing, but I wanted to get some feedback on which route to chose - the simpler version and accept that it will miss out sometimes or the more complex version (which might miss out because it's too clever). I am aware of the general problems of using regular expression with html (you can never catch everything) and that there will be things that the srcdoc attribute that if used will mess things up for me (but that would not be used together with the src attribute so I should be safe there).
Question recap: I want advice on the quality of the regular expressions and on which strategy to pursue - simple/complex.