28

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?

For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

eozzy
  • 66,048
  • 104
  • 272
  • 428
  • 1
    Sending random `user-agent` string works for me always, I do not get banned. Even if I get, I will change my IP. Or you can use Selinium to generate a full-browser request – Umair Ayub Jul 23 '15 at 07:49
  • Are you making hundreds of requests per minute? I'm talking about that much volume. I know about the user-agent, its easy, but IP? – eozzy Jul 23 '15 at 07:50
  • Yes, I do make 100s of requests per minute, Why not schedule a VPN to change your IP on a regular interval if you are getting blocked. http://www.adeepbite.com/hidemyass-vpn-review/#Schedule_IP_Address_Change – Umair Ayub Jul 23 '15 at 07:56
  • VPN is more reliable than proxies, but I'll be scraping programatically from the server using PHP or Node not sure yet. Does HMA have an API? – eozzy Jul 23 '15 at 08:08
  • I guess HMA doesnt have any API, As you said `but I'll be scraping programatically` so you can run PHP scripts as you want and run HMA separately as an application and schedule the IP change – Umair Ayub Jul 23 '15 at 08:16
  • I'm still asking myself what the [jquery], [php] and [phantomjs] tags are doing here. – Artjom B. Jul 26 '15 at 12:55
  • @FuzzyTree I'm not certain if I follow you. All websites are accessible in browser and the ones I'm after doesn't require authentication cookies or anything. – eozzy Jul 27 '15 at 16:58
  • @FuzzyTree hmm ... sorry but I still don't follow you, can you please put up an example somewhere? – eozzy Jul 27 '15 at 17:27
  • just use greasemonkey or tampermonkey to monitor the urls in question, exporting the data via postMessage() when the page loads. then, on another tab, use something like pubnub/pusher/etc to change the src of an iframe on that page to one of the ones xmonkey watches. all in all, 10 lines code max.note that iframes can be blocked, so you might need to use window.open, but you can re-use that popup on many shopping sites. – dandavis Jul 28 '15 at 06:29
  • @dandavis but greasemonkey is an add-on that needs be installed on the user's machine right? – eozzy Jul 28 '15 at 06:32
  • 1
    Consider implementing a browser extension to do the scraping. It can bypass the same origin policy. – NeoWang Jul 31 '15 at 08:10
  • @NeoWang Hmm .. is there a tutorial, start guide or opensource demo available that I can use as base? – eozzy Jul 31 '15 at 09:22
  • @3zzy http://developer.chrome.com/extensions/getstarted.html It's just javascript, with less restrictions and more privilege. – NeoWang Jul 31 '15 at 10:04

4 Answers4

43

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.

There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.

However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.

Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.

So now back to your problem:

I need to scrape pages of an e-com site but several requests from the server would get me banned.

If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).

If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

Community
  • 1
  • 1
Johann Bauer
  • 2,488
  • 2
  • 26
  • 41
  • Javascript browser plugins? Do you mean browser extensions found in Chrome and Mozilla store? I'm ok with Java because all I need is the page HTML, processing is done server-side. And no, its not illegal. The providers just don't have an API yet, and those who do, don't provide the content I need. – eozzy Jul 26 '15 at 05:06
  • Luminati is nice, but $1000? Heck! Btw, java is an option, any other alternatives? Can flash scrape content? – eozzy Jul 26 '15 at 05:08
  • @3zzy So the website owner keeps banning you by accident? ;) – Johann Bauer Jul 26 '15 at 09:55
  • @3zzy Yes, that's really expensive. But as you are buying traffic from the (probably already slow) home internet of Hola users, I guess they can sell it at that price. Maybe there are cheaper alternatives but I don't know. – Johann Bauer Jul 26 '15 at 09:59
  • They haven't banned me, they probably won't. My question is for when I get loads of traffic. – eozzy Jul 26 '15 at 10:00
  • And no, as I said, Flash has similar security measures. – Johann Bauer Jul 26 '15 at 10:00
  • 2
    So indeed there is no hack or workaround to my problem, therefore I've decided to go with browser plugins / extensions and hybrid apps for mobile platforms. – eozzy Jul 31 '15 at 14:23
  • I know this is an old question but I wanted to mention there is one way around this: you can use a reverse proxy (e.g. `nginx`) to dynamically `proxy_pass` a domain to another one (i.e. `google.com` to `google.example.com`) and add or strip headers from it (e.g. `Access-Control-Allow-Origin: *`). This will trick the user's browser into thinking that the response server does not care if it is scraped or not, but does require overhead from your part in allocating the `nginx` service. Due to the change of domains, thank goodness, sessions are not transferrable (no access to logged accounts). – felipe Sep 26 '20 at 15:01
  • The only way you could do this is using Electron (packaged with your compiled app) or by just opening pages directly from the disk, since local pages don't have an origin, they ignore CORS policies and you can access any remote resource from JavaScript on a page loaded that way. – Igor Gunin Jun 22 '23 at 00:53
6

Basically browsers are made to avoid doing this…

The solution everyone thinks about first:

jQuery/JavaScript: accessing contents of an iframe

But it will not work in most cases with "recent" browsers (<10 years old)

Alternatives are:

  • Using the official apis of the server (if any)
  • Try finding if the server is providing a JSONP service (good luck)
  • Being on the same domain, try a cross site scripting (if possible, not very ethical)
  • Using a trusted relay or proxy (but this will still use your own ip)
  • Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
  • Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
  • Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
  • Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
  • Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch

[EDIT]

One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you. Here is a simple example to do so, In short, you get cross domain GET requests

Community
  • 1
  • 1
Flavien Volken
  • 19,196
  • 12
  • 100
  • 133
  • Whats this about: http://www.slideshare.net/SlexAxton/breaking-the-cross-domain-barrier ? – eozzy Jul 29 '15 at 14:38
  • @3zzy I did not knew about the Document.domain hacks (page 30) sounds interesting, the other alternatives are mostly for accessing services, not the whole page. BTW I added one more solution, going through YQL might be the answer. You do not have the flexibility of an iFrame, don't expect neither to render the page with all the CSS on the client but at least you have something runnable on the client's browser without having to setup your own relay. – Flavien Volken Jul 29 '15 at 22:02
  • YQL is still a proxy though, it would use Yahoo's IP instead and that can still potentially be blocked if I make to many requests. I'm looking for a robust solution. – eozzy Jul 30 '15 at 03:45
  • @3zzy sure, but they are probably not using one only ip. No other idea for now – Flavien Volken Jul 30 '15 at 10:07
  • @3zzy There are official instructions on how to block YQL: https://developer.yahoo.com/yql/guide/limit_access_content_providers.html – thdoan Jul 26 '16 at 09:53
3

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

Jan
  • 42,290
  • 8
  • 54
  • 79
  • You use them for occasional scraping or some hundred request / min? – eozzy Jul 29 '15 at 14:38
  • To be honest we only use them for occasional scraping (mostly financial data that is). This works flawlessly. On their website it reads they collect 10 million records/day. That comes down to 115 records/second. Not for one website of course, but still they are very reliable and free of cost (despite my promotion I don't have any stocks ;-) – Jan Jul 29 '15 at 15:16
  • I really don't see how that answers the question, considering this runs on a server and would probably be still subject to IP bans and such. – Artjom B. Jul 30 '15 at 11:49
  • In my opinion, this answers the quesition in so far as the original purpose was to scrape information from another website. The service provides exactly this feature. – Jan Jul 30 '15 at 16:06
  • 1
    May I request you to [vote to undelete this question please](https://stackoverflow.com/a/58310335/548225) – anubhava Feb 26 '21 at 18:29
  • 1
    @anubhava: Glad to help. Any particular reason you're interested in reopening the question? Besides your rep? – Jan Feb 26 '21 at 21:30
  • You may be probably aware of this [ongoing mega debate](https://meta.stackoverflow.com/questions/405460/what-to-do-when-one-person-tries-to-delete-every-duplicate) on dupes, closing and deletions. I am of the opinion that it is fine to mark dupe but removal should be for those rare cases when questions are of total junk value. – anubhava Feb 26 '21 at 21:35
1

You could build an browser extension with artoo.

http://medialab.github.io/artoo/chrome/

That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

  • Artoo is a nice tool to use to get data when you're already on the website, but it won't allow you to systematically scrape, which I believe is what the op was after. – thdoan Jul 26 '16 at 10:38