7

I've been asked to scrape a site which receives data via websockets and then renders that to the page via javascript/jquery. Is it possible to bypass the middleman (the DOM) and consume/scrape the data coming over the socket? Might this be possible with a headless webkit like phantomJS? The target site is using socket.io.

I need to consume the data and trigger alerts based on keywords in the data. I'm considering the Goutte library and will be building the scraper in PHP.

codecowboy
  • 9,835
  • 18
  • 79
  • 134
  • What kind of technology do you have available? Do you have root access to the machine or are you using shared hosting? Goutte won't help you since they only scrape 'non-socket' content. – Herman Nov 12 '13 at 08:53
  • I'll be picking the hosting. Was thinking of using openshift as a dev server but could also use an Ubuntu image on EC2. I was going to use goutte to log in and then scrape details of the socket connection before actually using it. My fallback is just to watch for Dom events but it will obviously be much less efficient. – codecowboy Nov 12 '13 at 22:09
  • can you post the site URL and the description of data you want to scrape? – Tomas Jan 06 '14 at 21:18
  • I can't I'm afraid, no. – codecowboy Jan 06 '14 at 21:40
  • @Tomas this the actual problem I am having but was not able to start a bounty on this question -http://stackoverflow.com/questions/20949884/why-dont-i-see-a-response-from-socket-io-client-with-node-js – codecowboy Jan 06 '14 at 21:56

2 Answers2

6

Socket.io is not exactly the same as websockets. Since you know they use socket.io i'm focussing on that. The easiest way to scrape this socket is using the socket.io client.

Put this on your page:

<script src="https://github.com/LearnBoost/socket.io-client/blob/0.9/dist/socket.io.js"></script>
<script src="scraper.js"></script>

Create file scraper.js:

var keywords = /foo|bar/ig;
var socket = io.connect('http://host-to-scrape:portnumber/path');
socket.on('<socket.io-eventname>', function (data) {
  // The scraped data is in 'data', do whatever you want with it
  console.log(data);

  // Assuming data.body contains a string containing keywords:
  if(keywords.test(data.body)) callOtherFunction(data.body);

  // Talk back:
  // socket.emit('eventname', { my: 'data' });
});

UPDATE 6-1-2014

Instead of running this on the server it looks like your trying to run this in a browser window, looking at the StackOverflow question you referenced below. So I removed everything about NodeJS as that is not needed.

Herman
  • 1,534
  • 10
  • 16
  • Have you tried this on your scrape-source? Did it work? There can be some handshaking on this socket, if so, it might not work without you emulating that. – Herman Nov 16 '13 at 10:30
  • no I have not tried this yet. I have never really used node so need to think about whether I am willing to invest time in learning that for a small project. – codecowboy Nov 16 '13 at 11:15
  • Don't be afraid of Node. It is just JavaScript with a little bit more power. Node is ideal for realtime stuff, much more than PHP is. You'll learn Node within a day. – Herman Nov 16 '13 at 13:40
  • Not sure I agree the whole of Node can be learned in a day but your answer has given me a useful starting point. Thanks! – codecowboy Nov 18 '13 at 10:47
  • would be grateful if you could take a look at http://stackoverflow.com/questions/20937627/how-can-i-wait-for-a-socket-io-connection-to-return-data-from-casperjs if you have any insight. – codecowboy Jan 05 '14 at 23:02
  • I ended up doing something similar but with casperJS. I would advise anyone to avoid using socket.io-client (the node module) as it didn't work for me and nobody involved in the project replied on the mailing list or via github. – codecowboy Jan 12 '14 at 19:03
  • I wish I could give you more 150 points for this answer :D Thanks – Deval Khandelwal Feb 27 '18 at 11:24
-2

This would be the best way for you in my opinion :

Scrap the data directly from the client page of your app using javascript without using php as middle end. With this way your server will have not absolutely any load and i will recommend this. As your target site is using socket.io, use a socket.io client to scrap the data. Form socke.io offiscial site:

    <script src="/socket.io/socket.io.js"></script>
    <script>
      var socket = io.connect('http://target_website.com');
              //look the next line closely
      socket.on('event_name', function (data) {
        console.log(data);
        //do something with data here
      });
    </script>

As the question arises , how will you know *event_name*? You have to find that by doing research on the target site's js. There is no work around. At least i do not know any of them without them.

MD. Sahib Bin Mahboob
  • 20,246
  • 2
  • 23
  • 45