4

I would like to scrape a website by just running code in a browser. In this case, the scraper has to run on a specific machine, and I cannot install any software on that machine. However, there is already a browser installed (recent version of Firefox), and I can configure the browser however I want.

What I would like is a javascript solution for scraping, contained in a webpage on site A, that can scrape site B. It seems like this would run into some CORS-type problems; I assume that part of the solution is to disable any cross-origin checks in the browser.

What I have tried so far: I looked up "web scraping in javascript", this brings up a lot of stuff intended to run in nodejs with cheerio for example this tutorial, and also stuff like pjscrape which requires PhantomJS. However, I couldn't find anything equivalent that is intended to run in a browser.

P.S. This is interesting: Firefox setting to enable cross domain ajax request Apparently Chrome --disable-web-security takes care of the cross-origin/cross-domain issues. Firefox equivalent?

P.S. Looks like ForceCORS extension to Firefox is also useful: http://www-jo.se/f.pfleger/forcecors I'm not sure if I'll be able to install that though.

P.S. This is a nice collection of ways to allow cross-domain in different browsers: http://romkey.com/2011/04/23/getting-around-same-origin-policy-in-web-browsers/ Sadly, the suggested Firefox solution doesn't work in versions >=5.

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
Alex I
  • 19,689
  • 9
  • 86
  • 158
  • Looks like you have some useful links to read over. What is your **specific** question? – Ray Nicholus Apr 10 '14 at 14:16
  • @RayNicholus: These are links to people trying to solve the same problem, but none of them describe a solution that works in recent Firefox versions, let's say newer than 2011. ForceCORS apparently fails, and the enablePriviledge() API is no longer available. – Alex I Apr 10 '14 at 17:05
  • What specific browsers are you targeting? Unless you have complete control of the browser, the only way you can reliably pull this off is by proxying the site you wish to scrape via a server you control. – Ray Nicholus Apr 10 '14 at 17:07
  • @RayNicholus: I'm targetting Firefox 11 or later. I do have complete control of the browser, but no ability to run a proxy. – Alex I Apr 11 '14 at 08:58
  • Your best bet would be to install your app as an extension, where the same origin policy enforcement is more under your control. – Ray Nicholus Apr 11 '14 at 14:46

1 Answers1

-2

edit: looks like import.io service shut down and the url points to something completely different now. consider this answer obsolete.

try to do it with import.io: ( basically a scraping service with REST API)

as soon as i have a example javascript call to the API i can provide it. Or you check the docs yourself.

Import.io allows you to structure the data you find on webpages into rows and columns, using simple point and click technology.

First you locate your data: navigate to a website using our browser (download it from us here: http://import.io).

Then, enter our dedicated data extraction workflow by clicking the pink IO button in the top right of the Browser.

We will guide you through structuring the data on the page. You teach import.io how to extract the data by showing us examples of where the data is. We create learning algorithms that generalize from these examples to work out how to get all the data on the website. The data you collect is stored on our cloud servers to be downloaded and shared. And every time you publish to our platform we create an API to get the data programatically so you can easily integrate live web data into your applications or third party analytics and visualization software.

EDIT:

If the data recognition works in the browser you can simply access the data by heading to "simple API integration" and Copy the url

export data in import.io

the url u can paste here:

function reqListener () {
    console.log(JSON.parse(this.responseText));
    return JSON.parse(this.responseText);
}

var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("GET", "yourUrlFromClipboardComesHere", true);
oReq.send();

xhr request source

philx_x
  • 1,708
  • 16
  • 23