10

So I am trying to scrape some content with node.js x-ray scraping framework. While I can get the content from a single page I can't get my head around on how to follow links and get content from a subpage in one go.

There is a sample on x-ray github profile but it returns empty data if I change the code to some other site.

I have simplified my code and made it crawl the SO questions for this sample.

The following works fine:

var Xray = require('x-ray');
var x = Xray();

x('http://stackoverflow.com/questions/9202531/minimizing-nexpectation-for-a-custom-distribution-in-mathematica', '#content', [{

  title: '#question-header h1',
  question: '.question .post-text'

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

This also works:

var Xray = require('x-ray');
var x = Xray();

x('http://stackoverflow.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  question: x('h3 a@href', '#content .question .post-text'),

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

but this gives me empty details result and I can't figure out what is wrong:

var Xray = require('x-ray');
var x = Xray();

x('http://stackoverflow.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  link: 'h3 a@href',
  details: x('h3 a@href', '#content', [{
    title: 'h1',
    question: '.question .post-text',
  }])

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

I would like my spider to crawl the page with listed questions and then follow the link to each question and retrieve additional information.

Ales Maticic
  • 1,895
  • 3
  • 13
  • 27
  • Can find some answers related to this issue here: [x-ray scraping secondary urls related question ](https://stackoverflow.com/questions/39609440/node-x-ray-crawling-data-from-collection-of-url/39632464) – sylvery Oct 20 '17 at 02:54

2 Answers2

8

So with with some help I figured out what the problem was. I am posting this answer in case somebody else might have the same problem.

Working example:

var Xray = require('x-ray');
var x = Xray();

x('http://stackoverflow.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  link: 'h3 a@href',
  details: x('h3 a@href', {
    title: 'h1',
    question: '.question .post-text',
  })

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})
Ales Maticic
  • 1,895
  • 3
  • 13
  • 27
  • hmm, are you sure you pasted the correct code? Still can't get mine to work. Doing exactly the same thing, has exactly the 3 symptoms you've described in the question.. – krivar Oct 18 '15 at 11:46
  • can you provide the kind of result you get from this query? – krivar Oct 18 '15 at 11:49
  • looks like this is an issue for other people. So I'm wondering how you got this thing working? https://github.com/lapwinglabs/x-ray/issues/65 – krivar Oct 18 '15 at 12:01
  • does it work if you just copy paste the code above? I will try it again as soon as I get home and will get to you with the results. – Ales Maticic Oct 19 '15 at 09:33
  • I tried running your code but the details properties is not been returned. Is it still working for you? Do you know if this nested crawling behaviour has been deprecated? – Laggel Jan 09 '16 at 15:28
  • did you run the code I have provided above or something else, I just tried it and works fine for me. Did you got any error? – Ales Maticic Jan 10 '16 at 21:22
  • this code still works for me, are you using the same code and trying to scrape SO? – Ales Maticic Mar 17 '16 at 16:06
  • Not working for me either, I copy pasted your answer but I'm getting `undefined` as the output. – Prashanth Chandra May 07 '16 at 14:44
  • Nnot working here for whatever reason. (x-ray 2.3.0) – Vaughan Hilts Jun 04 '16 at 16:31
1

version 2.0.2 does work.. there is a current issue in github here to followhttps://github.com/lapwinglabs/x-ray/issues/189