Using cURL to retrieve website and bypass same origin restriction, inserting javascript

Question

I need to load several websites in iframes whilst also adding a google translate plugin into each page so they can be translated. Here's my code for the insertion part:

<iframe onload="googleJS1(); googleJS2(); googleJS3();" class=iframe2 src=http://localhost:8888/mysitep></iframe>

<script>
    function googleJS1() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('div');
        newScript.setAttribute("id", "google_translate_element");
        var bodyClass = doc.getElementsByTagName('body')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[0]);
    }

    function googleJS2() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('script');
        newScript.setAttribute("src", "http://translate.google.com/translate_a/element.js?    cb=googleTranslateElementInit");
        var bodyClass = doc.getElementsByTagName('head')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[1]);
    }

    function googleJS3() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('script');
        newScript.setAttribute("src", "http://localhost:8888/mysite/google.js");
        var bodyClass = doc.getElementsByTagName('head')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[2]);
    }
}
</script>

This works as long as the iframe target URL is on the same server. I read to bypass the same origin constraint I should set up a proxy server and pass all URL requests via the proxy. So I read up on cURL and tried this as a test:

<?php

function get_data($url) {
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$test = get_data("http://www.selfridges.com");
echo $test;

?>

The basic HTML elements are loaded yet no CSS and images are loaded. Also the links still point to the original URL. I need some suggestions on how I can also pull the CSS, images and js off the target URL into a proxy and load the pages from there, making it look like it came from the same domain and ports and by passing the same origin policy. I also need the links to work in this fashion.

e.g:

main page - http://localhost:8888/proxy.php 

links     - http://localhost:8888/proxy.php/products/2012/shoes

Any other methods or alternatives are also welcome.

Thanks

Using the useragent to mimic GoogleBot is probably a bad idea. — Adam Hopkinson, Jan 19 '13 at 21:32
Then you don't need a useragent - curl will send one automatically. — Adam Hopkinson, Jan 19 '13 at 21:40

score 1 · Answer 1 · answered Jan 19 '13 at 21:47

1

Assuming all the links & images in your target documents are relative, you could inject a base tag into the head. This would effectively make the links absolute, so the links & images would still refer to the target domain (not yours).

http://reference.sitepoint.com/html/base

Not sure how this would work with css images though.

A solution that will work consistently for any target site is going to be tough - you'll need to parse out links not only in the html, but in any css references. Some sites might use AJAX to populate the pages, which will cause same origin policy issues on the target site too.

answered Jan 19 '13 at 21:47

Adam Hopkinson

28,281
7
65
99

I was assuming that it would be something similar to the solution in this question http://stackoverflow.com/questions/6326297/load-external-sites-content where all of the links simply have to be parsed so that they my "localhost:8888/" at the beginning. Would you say using a wget to download the whole site and then running that off my server would be a solution? Would wget be able to download dynamic php sites? – jc.yin Jan 19 '13 at 22:04
Yes, but the sites would not be dynamic anymore – Adam Hopkinson Jan 19 '13 at 22:06
Let's a site uses Wordpress or some other CMS, after downloading with wget would the contents still be accessible in the static format or would they not work at all? – jc.yin Jan 19 '13 at 22:20
Of course it will still be accessible - wget gets a page in the same way a browser does, it just doesn't display the page. – Adam Hopkinson Jan 19 '13 at 22:23
That's no good if the page doesn't display so it's not really useful for my case. Referring to your answer, "This would effectively make the links absolute, so the links & images would still refer to the target domain (not yours)." Does this mean clicking on those links will take the user to the target domain? That would mean after those links I would still run into the same origin issue? – jc.yin Jan 19 '13 at 22:31
No - wget saves the page to a local folder. Your browser would display it, i was just explaining that the way wget fetches a page is the same way your browser does it. – Adam Hopkinson Jan 19 '13 at 22:33
Okay I'll give it a try with wget. Once it's saved to a local folder I'm assuming that I can refer to it like any other static site right? e.g. localhost:8888/downloadedsite/index I'm still slightly confused, what functions _will_ be missing from a wget downloaded dynamic site versus a non-downloaded dynamic site? – jc.yin Jan 19 '13 at 22:41

Using cURL to retrieve website and bypass same origin restriction, inserting javascript

1 Answers1