javascript DOMParser parsing document not the string

Question

I've searched around the web and StackOverflow but didn't find anything quite like the problem I have.

I have the HTML string bellow:

var txtBoxForm = '<script src="http://ADDRESS"></script><noscript><a href="http://ADDRESS" target="_blank"><img src="http://ADDRESS" border=0 width=728 height=90></a></noscript>';

I am trying to parse it with:

parser = new DOMParser()
xmlDoc = parser.parseFromString(txtBoxForm, "text/xml");
alert(xmlDoc);
alert(xmlDoc.firstChild.nodeName);
alert(xmlDoc.firstChild.firstChild.nodeName);
alert(xmlDoc.firstChild.firstChild.firstChild.nodeName);
alert(xmlDoc.firstChild.firstChild.firstChild.firstChild.nodeName);

The problem is that even though the string begins with tag and there are no child nodes, I get the bellow returns from the alerts:

alert(xmlDoc);   ->   [Object document]
alert(xmlDoc.firstChild.nodeName);    ->    html
alert(xmlDoc.firstChild.firstChild.nodeName);    ->    body
alert(xmlDoc.firstChild.firstChild.firstChild.nodeName);    ->    parseerror
alert(xmlDoc.firstChild.firstChild.firstChild.firstChild.nodeName);   ->    h3

So my questions are:

How come the parsed code does not begin with <script>, since the string does?
Am I doing something wrong?
How could I correctly parse that string code? My intention is to capture the src from the script and img tag.

Please help. Thanks.

Ruan Mendes · Accepted Answer · 2012-03-30T19:15:19.730

3

It seems like you cannot pass a script tag to DOMParser plus there were a few other problems.

an XML doc must have a single root element (I wrapped your code with <doc></doc>)
scripts are not allowed (I changed it to <scripto>)
You must quote your attributes

http://jsfiddle.net/mendesjuan/aVQaP/4/

var txtBoxForm =
  '<doc>'+
    '<scripto src="http://ADDRESS"></scripto>'+
    '<noscript>' + 
      '<a href="http://ADDRESS" target="_blank">'+
        '<img src="http://ADDRESS" border="0" width="728" height="90" />'+
      '</a></noscript></doc>';

var parser = new DOMParser();
var xmlDoc = parser.parseFromString(txtBoxForm, "text/xml");

// outputs http://ADDRESS
console.log( xmlDoc.getElementsByTagName("scripto")[0].getAttribute("src") );
// outputs http://ADDRESS
console.log( xmlDoc.getElementsByTagName("img")[0].getAttribute("src") );

edited Mar 30 '12 at 19:15

answered Mar 30 '12 at 18:54

Ruan Mendes

90,375
31
153
217

Nope. Just tried that and got the same result as above on those alerts :( – decio Mar 30 '12 at 18:59
@decio Modified the example a little bit, no more `` – Ruan Mendes Mar 30 '12 at 19:16
I am trying that right now and will post the results. As I answered @Daxcode, I have another problem with reformatting the html code but will get to that after solving the first issue. thx – decio Mar 30 '12 at 19:32
Thanks, your solution works. As I told @Daxcode, I am just trying to find a way to cope with the user inserting an unformated javascript code on the text box field. The code I've posted is taken from an user insertion, so I have to make sure the code is correctly formated to parse it. – decio Apr 03 '12 at 18:53

Daxcode · Answer 2 · 2012-03-30T19:01:54.643

0

the string you would like to parse is malformed. If you try your script with a simple string e.g. '<div><p>test</p></div>', it is parsing the elements as expected.

I'm assuming, the security policies don't allow, to grab script tags like that, in order to prevent script loading manipulations etc.

Regarding using regular expressions instead, below is providing you the src values of both attributes from your string as you might expect.

<script type ="text/javascript" language="javascript">
<!--
var txtBoxForm = '<div><script src="http://ADDRESS"></script><noscript><a href="http://ADDRESS" target="_blank"><img src="http://ADDRESS" border=0 width=728 height=90></a></noscript></div>';
var exp = /src="([^"]*)"/i;
console.log(exp.exec(txtBoxForm));

-->
</script>

edited Mar 30 '12 at 19:01

answered Mar 30 '12 at 18:48

Daxcode

424
3
7

Yes, I know it's not well formatted, but that is the normal code inserted into a form textfield by the user who is posted a javascript tag for an ad. I have to treat the content to make sure there is no unintended code so I can show it later on. From what you said, Daxcode, how should I preformat the code to be able to parse it? I've tried only adding a `
` to the beginning and a `
` to the end to no avail. – decio Mar 30 '12 at 18:55
have you tried to use regular expressions instead, in order to grab the src attrib values? – Daxcode Mar 30 '12 at 18:56
I've extended my answer with an example using regular expressions instead – Daxcode Mar 30 '12 at 19:02
Well, no, I was trying the dom methods since I can do it with PHP, but need to do it in javascript and bacause of many recomendations I've found. [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Is there any other way? – decio Mar 30 '12 at 19:03
so you need primary the reflecting DOM structure of the string instead of the src attribute values in particular? – Daxcode Mar 30 '12 at 19:14
Yeah, it worked! Thanks! Now that I see it working that way, I am puzzled, though. I was trying to go with DOM, because I am not sure I will only have to capture that with that data from the posted tag. For example, on the code I posted I would have to capture all img attributes as well. Isn't there really another way to use dom with that kind of code? – decio Mar 30 '12 at 19:15
I think my commet above explains a bit further. It's an ad submitting application. Since the user can submit external tags and those come in different formats depending on the ad server used, I need to capture some essential data from the tag, but I won't always know how it will be formated. At first I tried something like getElementsByTagName, which would be much easier, but that didn't work. thx – decio Mar 30 '12 at 19:19
therefor you should refer Juan Mendes answer. But since script tags are not allowed to parse on clientside, I doubt a solution of getting the corr. DOM nodes might work for you. Haven't you the chance, to do that on serverside instead? – Daxcode Mar 30 '12 at 19:23
That's the problem. I end up with the same problem I posted on [this question] (http://stackoverflow.com/questions/9911887/securelly-posting-and-then-printing-javascript-tags). Chrome and Safari seem to filter that kind of code when I print it (so the trafficker submiting the ad can check and see if it's working ok) and throws an error: Resource interpreted as Script but transferred with MIME type text/html: "about:blank". I've tried to break down the code with php and reconstruct to print it but get the same error. So I thought in doing the same with js and sending only and array of values. – decio Mar 30 '12 at 19:31
:( Wan't me. I can't vote yet, it seems. :( Actually I found both of your answers really helpful. I am just trying to find a way to cope with the user inserting an unformated javascript code on the text box field. The code I've posted is taken from an user insertion, so I have to make sure the code is correctly formated to parse it. – decio Apr 03 '12 at 18:53

javascript DOMParser parsing document not the string

2 Answers2