0

I am trying to write a web scraper for a NYC database of building and I am trying to get the html of the actual website. For whatever reason, when I put the url of the website I am trying to scrape, my program does nothing. Whenever I put the url of almost any other website, I actually get the html i requested. Is this because I am trying to scrape a government site?

var request = require("request");

request(
    { uri: "http://a810-bisweb.nyc.gov/bisweb/JobsQueryByNumberServlet?requestid=3&passjobnumber=123768556&passdocnumber=01" },
    function(error, response, body) {
        console.log(body);
        console.log("hello")
    }
);

I expected to recieve the html as a string printed in my console, instead, I get nothing. The "hello" is not even printed. However, when I try any other site, I get the actual html string.

Mike Poole
  • 1,958
  • 5
  • 29
  • 41

2 Answers2

2

The url you are trying to get is giving an access denied.

I prefer the promise based api for request so the following code

var request = require("request");
request
  .get("http://a810-bisweb.nyc.gov/bisweb/JobsQueryByNumberServlet?requestid=3&passjobnumber=123768556&passdocnumber=01")
  .on('response', function(response) {
    console.log('Hello');
    console.log(response.statusCode);
    console.log(response.headers['content-type']);
  })
  .on('error', function(error){
    console.log(error);
  })

will print out

Hello
403
text/html

I am supposing the reason why you are getting that 403 is the site probably sets cookies or has some session state and you are going directly to the url you want instead of hitting the front page first. I get the 403 as well in the browser if I go directly to the url, but if I go to the front page first and then to the url I get the page.

user254694
  • 1,461
  • 2
  • 23
  • 46
  • Thanks so much for the help. Is there any way to bypass this in node. Can i simulate this session state? – Omar Elhosseni Jul 17 '19 at 07:38
  • you can see some examples here https://stackoverflow.com/questions/19936705/how-to-maintain-a-request-session-in-nodejs that is post but applies, also set the request headers to say you came from the front page. Also play around with it, see if you can do it just by setting the request header to say you came from the front page of the site. This shows how to set headers with request https://github.com/request/request#custom-http-headers so to set referer headers['Referer'] = "http://a810-bisweb.nyc.gov/" – user254694 Jul 17 '19 at 07:51
0

For anyone wondering, I was able to work around the restrictions the site set up by using tampermonkey. I just needed to access the DOM anyway, so tampermonkey let me run a script as soon as I entered the site