0

I am trying to create a client side scraper. I would like to use only javascript or jQuery that will run on the client side and fetches the html output in JSON format and displays it on my webpage.
Here is what I tried:

<html>
<head>
<script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
<script type="text/javascript">
   var ExternalURL = "www.example.com"; // This address must not contain any leading "http://"
   var ContentLocationInDOM = "#someNode > .childNode"; // If you’re trying to get sub-content from the page, specify the "CSS style" jQuery syntax here, otherwise set this to "null"

   $(document).ready(loadContent);
   function loadContent()
   {
      var QueryURL = "http://anyorigin.com/get?url=" + ExternalURL + "&callback=?";
      $.getJSON(QueryURL, function(data){
         if (data && data != null && typeof data == "object" && data.contents && data.contents != null && typeof data.contents == "string")
         {
            data = data.contents.replace(/<script[^>]*>[sS]*?</script>/gi, ");
            if (data.length > 0)
            {
               if (ContentLocationInDOM && ContentLocationInDOM != null && ContentLocationInDOM != "null")
               {
                  $(‘#queryResultContainer’).html($(ContentLocationInDOM, data));
               }
               else
               {
                  $(‘#queryResultContainer’).html(data);
               }
            }
         }
      });
   }
</script>
</head>
<body>
<div id="queryResultContainer"/>

But I do not want to use any other website API for the accomplishment of my query. As one can see the API is used to fetch the html of other website.

What I am looking for is just a simple way to extract the HTML body content from a website and dispplay it on the web page, but the request and response is all client side. There should be no interference of server side script. Please help me with your suggestion.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • 2
    `fetches the html output in JSON format` - now that's the craziest idea I've seen - beware [ZA̡͊͠͝LGΌ](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jaromanda X Apr 05 '17 at 05:50
  • Please fix your quotes. – Bergi Apr 05 '17 at 05:50
  • @JaromandaX beware of what my friend? – Jaffer Wilson Apr 05 '17 at 05:51
  • @Bergi What fixes? Please suggest me. – Jaffer Wilson Apr 05 '17 at 05:52
  • 3
    No, you will need to use some API, be it theirs or your own. [You are not allowed to directly access](https://en.wikipedia.org/wiki/Same-origin_policy) the HTML content of arbitrary pages from the client browser. – Bergi Apr 05 '17 at 05:52
  • @Bergi I have seen that if there is request, then there is response. And what I am thinking is to get the response and display the HTML. There is no any hacking or something. Obviously, I will obey the robots.txt – Jaffer Wilson Apr 05 '17 at 05:53
  • 2
    @JafferWilson Use `"` instead of `”“` and `'` instead of `‘’`. You probably need a proper editor. – Bergi Apr 05 '17 at 05:53
  • ZA̡͊͠͝LGΌ - he is Tony the Pony - see the link - you look like you're attempting to use regex on HTML – Jaromanda X Apr 05 '17 at 05:54
  • `you will need to use some API` - that's what `http://anyorigin.com/get?url=` (supposedly) does for you (though, it's `http://anyorigin.com/go?url=` if you look at the site) – Jaromanda X Apr 05 '17 at 05:54
  • @JafferWilson I know what you were thinking to do. It's not possible. You need a proxy – Bergi Apr 05 '17 at 05:58
  • Is there no mechanism to achieve this without the use of API? – Jaffer Wilson Apr 05 '17 at 05:58
  • 1
    That's not possible to be done for any arbitrary website due to the [same origin policy](https://en.wikipedia.org/wiki/Same-origin_policy) restriction that's built into the browsers. The remote website need to explicitly allow CORS for your domain for this to work. Other than that you will need a server side proxy and then make the AJAX requests to this proxy. If what you are trying to achieve was possible then this would have made the same origin policy useless (which is not the case). – Darin Dimitrov Apr 05 '17 at 06:03
  • @JafferWilson After fetching the URL, you should be able to do something like this: `var parser = new DOMParser(); var html = parser.parseFromString(data, "text/html");` to convert the returned data to HTML. – Coder828 May 04 '17 at 04:48
  • @Coder828 Thank you but I didn't understood how to do it with my code. Please can you help to change my code as you say? I would like to know about it. – Jaffer Wilson May 04 '17 at 06:08
  • @JafferWilson You'll be taking your "data" parameter, converting it to HTML, and then you can parse the HTML. [This link should explain my comment (and previous one) better](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). – Coder828 May 04 '17 at 16:19
  • @Coder828 Thank you. Let me check.. – Jaffer Wilson May 05 '17 at 05:23

0 Answers0