0

I need to use javascript to replace the src attribute of an iframe using javascript to analyze raw (text-based) code (before injected into the DOM). A simple version could use a regular expression like this:

/<iframe\s*[^>]*src\s*=\s*(\"[^\"]*\"|\'[^\']*\'|[^\s>]*)/

Note: I'm only interested in iframes that have the src attribute set.

This will probably work in most cases, but the > character can occur within a string literal, for example <iframe id="pointforward\>" src="http://www.test.se"> which will cause the simple regular expression to fail to match (granted not something you would see much, but allowed and thus such variants can occur).

I came up with the following more complex version:

/<iframe\s+(?:\b(?:(src)|\w+)\b\s*=\s*((?:\"(?:\\[\s\S]|[^\"\\])*\")|(?:\'(?:\\[\s\S]|[^\'\\])*\')|[^\s>]+)\s*|\b(\w+)\b\s*)*>/

Breakdown:
<iframe\s+
(?:
    \b(?:(src)|\w+)\b               "src" or any attribute name (capture if src)
    \s*=\s*                         equals possible surrounded with whitespace
    ((?:                            Capture group for value
        \"(?:\\[\s\S]|[^\"\\])*\"   value enclosed in double quotes
    )|(?:                           OR
        \'(?:\\[\s\S]|[^\'\\])*\'   value enclosed in single quotes
    )|                              OR
        [^\s>]+                     value with no quotes
)\s*|                               OR
    \b(\w+)\b\s*                    standalone attribute
)*>                                 0..N attributes and closing > 

Description of regular expression functioning: The regular expression is intended to match iframe tags with a variable number of attributes. Attributes can be name=value pairs or standalone (such as seamless). The name=value pairs can have the value enclosed in double quotes, single quotes or no quotes - and the text between the quotes can contained escaped characters (including escaped double and single quotes). Capturing groups capture the src attribute name in the first capturing group and the value in the second. I will use the capturing groups to extract the src attribute values.

I am looking for advice on the quality of the regular expressions and on which to choose. The simple one risks to miss some unusually formatted iframe tags (and maybe some other also - please let me know if there are other problems I should consider). The complex one I don't really feel confident with - although I did write it myself I don't feel on top of it and wonder if I've really covered all the alternative ways an iframe tag can be formatted. Also, being more complex, it might be sluggish running. I've tested the two version only with simple code variations so far, and will obviously have to do more testing, but I wanted to get some feedback on which route to chose - the simpler version and accept that it will miss out sometimes or the more complex version (which might miss out because it's too clever). I am aware of the general problems of using regular expression with html (you can never catch everything) and that there will be things that the srcdoc attribute that if used will mess things up for me (but that would not be used together with the src attribute so I should be safe there).

Question recap: I want advice on the quality of the regular expressions and on which strategy to pursue - simple/complex.

instantMartin
  • 85
  • 2
  • 8
  • Can't you just query the iframe, then change its `src` property? Like `iframe.src = 'newsrc'`. Regex seems like the wrong tool for the job here. – elclanrs Mar 12 '15 at 10:18
  • Sorry - missed an important detail there in the description. I need to analyze the code as text. I've edited the description and added this information. Thanks @elclanrs for the pointer. – instantMartin Mar 12 '15 at 10:21
  • 2
    Why as text? Could you make a DOM element out of the string, then modify its src attribute? In any case, regex still looks like [the wrong tool](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – elclanrs Mar 12 '15 at 10:22
  • Like elclanrs said. You can safely parse HTML with a DOMParser, which won't run any scripts or load any resources, but will allow you to use all the usual DOM methods, CSS selectors etc to analyze the code. – Touffy Mar 12 '15 at 10:25
  • Using a DOMParser certainly sounds like a good idea. Originally I had a server side version using PHP to modify the code, so I suppose I got a bit functionally fixed on using regular expression for that reason. I haven't used DOMParsers much, but it seems to be like working with the browser DOM except there is no rendering - right?. You, @touffy, mentioned that scripts will not be run and resources will not be loaded. Is this the default behavior of the DOMParser or is that something you must/can configure? – instantMartin Mar 12 '15 at 11:05
  • A DOMParser will simply parse an HTML or XML string and return a Document. It is the only possible behavior (other than throwing an error). When you want the (possibly modified) Document to be displayed, you'll have to put its root (or other) element in the DOM of an existing window. – Touffy Mar 12 '15 at 11:10
  • From http://www.w3.org/TR/DOM-Parsing/#the-domparser-interface : "script elements get marked unexecutable and the contents of noscript get parsed as markup." – Touffy Mar 12 '15 at 11:15
  • What about overhead? I would have to create a temporary (non-rendered) DOM object with DOMParser, make my changes and since the DOMParser does not load external resources and doesn't run javascript I would have to inject the code into the DOM proper when it is to be rendered by the browser. Otherwise the resources won't be loaded, right? Or is there a way to get round this and transform the DOMParser object to a "normal" DOM object when moved to the DOM tree? Otherwize I will have the overhead of creating a temporary DOM object creating overhead - may be smaller than with regexp, but still.. – instantMartin Mar 13 '15 at 07:13

0 Answers0