2

I am trying to do web scraping and i would like to display the data in JSON format.

My task is to extract each post from the website and display its relevant data in JSON format. My issue is that i cannot seem to target the row () and then target each id. I can input the id in my code but i would like for the program to seacrh for the id and console log the data of each id in the row. Example: I want to get the title for the first post by id.

I hope i am making sense. The website i am trying to extract data from: here

My code:

 var express = require('express');
 var path = require('path');
 var request = require('request');
 var cheerio = require('cheerio');
 var fs = require('fs');
 var app = express();
 var port = 8080;

 var url= "https://news.ycombinator.com/";

 request(url, function(err,resp,body){
 var $ = cheerio.load(body);

   var title = $('tr');

   var uri
   var author
   var points
   var comments
   var rank

   var posts = {
       postTitle : title,
       postUri : uri,
       postAuthor : author,
       postPoints : points,
       postComments : comments,
       postRank : rank
   }

   console.log(posts)

   })

   app.listen(port);
   console.log('server is listening on' + port);
eyedfox
  • 714
  • 1
  • 7
  • 14
  • I don't even know that it was possible to use jQuery sintaxe with nodejs...but I think, to achieve what you want you need to install `jquery` using npm. [See this thread](http://stackoverflow.com/questions/1801160/can-i-use-jquery-with-node-js). If the `body` you mention it is the same of the image then I think you can get what you want using jquery nodejs plugin – Elmer Dantas Feb 16 '17 at 12:41

1 Answers1

1

The trick with hackernews is that three tr elements display one row. Thats why each element of rows inherits three subsequent elements of tr. Inside rows.map each item is one row and you can access the attributes "rowwise".

let cheerio = require('cheerio')
let request = require('request');

const url = "https://news.ycombinator.com/";
request(url, function(err,resp,body){
  let $ = cheerio.load(body);

  const tr = $('.itemlist > tr');
  let rows = Array((tr.length - 2)/3); //the last two are the More button

  for (var i = 0; i < (tr.length - 2)/3; ++i){
    rows[i] = tr.slice(3*i, 3*(i+1));
  }

  res = rows.map(function(item, index) {
    return {
      postTitle: $(item).find('.storylink').text(),
      postUri: $(item).find('.storylink').attr('href'),
      postComments: $(item).find('a+ a').text(),
    }
  })

  console.log(res);

})

Which gives you:

[ { postTitle: 'CockroachDB beta-20161013',
    postUri: 'https://jepsen.io/analyses/cockroachdb-beta-20161013',
    postComments: '10 comments' },
  { postTitle: 'Attacking the Windows Nvidia  Driver',
    postUri: 'https://googleprojectzero.blogspot.com/2017/02/attacking-windows-nvidia-driver.html',
    postComments: '7 comments' },
  { postTitle: 'DuckDuckGo Donates $300K to Raise the Standard of Trust Online',
    postUri: 'https://spreadprivacy.com/2017-donations-d6e4e4230b88#.kazx95v27',
    postComments: '25 comments' },
... ]
Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • How do i return the values of comments as an integer? – eyedfox Feb 20 '17 at 11:25
  • JSON format does not support types so i assume you want `postComments: '25'` instead of `postComments: '25 comments'`. To do this have a look at regex. You could do something like `const comment_pattern = new RegExp('^[0-9]+')` and then wrap the postComments line as `comment_pattern.exec('25 comments')[0]` – Rentrop Feb 20 '17 at 11:29
  • That would still return as a string right? So i wont be able to parse it to an Int because its not supported? – eyedfox Feb 20 '17 at 22:58