0

I have to obtain a json that is incrusted inside a script tag in certain page... so I can't use regular scraping techniques, like cheerio. Easy way out, write the file (download the page) to the server and then read it using string manipulation to extract the json (there are several) work on them and save to my db hapily.

the thing is that I'm too new to nodeJS, and can't get the code to work, I think that I'm trying to read the file before it is fully written, and if read it time before obtain [Object Object]...

Here's what I have so far...

var http = require('http');

var fs = require('fs');
var request = require('request');

var localFile = 'tmp/scraped_site_.html';
var url = "siteToBeScraped.com/?searchTerm=foobar"

// writing
var file = fs.createWriteStream(localFile);

var request = http.get(url, function(response) {
    response.pipe(file);
});

//reading
var readedInfo = fs.readFileSync(localFile, function (err, content) {
    callback(url, localFile);
    console.log("READING: " + localFile);
    console.log(err);
});
Cœur
  • 37,241
  • 25
  • 195
  • 267
Marcos
  • 1
  • 1
  • 4
  • Not enough time to provide a formal and tested answer but you're right. listen for file.on('finished', function() {}), see http://nodejs.org/api/stream.html#stream_event_finish – Jay Feb 10 '14 at 22:50

3 Answers3

2

So first of all I think you should understand what went wrong.

The http request operation is asynchronous. This means that the callback code in http.get() will run sometime in the future, but the fs.readFileSync, due to its synchronous nature will execute and complete even before the http request will actually be sent to the background thread that will execute it, since they are both invoked in what is commonly known as the (same) tick. Also fs.readFileSync returns a value and does not use a callback.

Even if you replace fs.readFileSync with fs.readFile instead the code still might not work properly since the readFile operation might execute before the http response is fully read from the socket and written to the disk.

I strongly suggest reading: stackoverflow question and/or Understanding the node.js event loop

The correct place to invoke the file read is when the response stream has finished writing to the file, which would look something like this:

var request = http.get(url, function(response) {
    response.pipe(file);
    file.once('finish', function () {            
        fs.readFile(localFile, /* fill encoding here */, function(err, data) {
            // do something with the data if there is no error
        });         
    });
});

Of course this is a very raw and not recommended way to write asynchronous code but that is another discussion altogether.

Having said that, if you download a file, write it to the disk and then read it all back again to the memory for manipulation, you might as well forgo the file part and just read the response into a string right away. Your code will then look something like so (this can be implemented in several ways):

var request = http.get(url, function(response) {
    var data = '';

    function read() {       
        var chunk;
        while ( chunk = response.read() ) {
            data += chunk;      
        }       
    }

    response.on('readable', read);

    response.on('end', function () {
        console.log('[%s]', data);
    });
});

What you really should do IMO is to create a transform stream that will strip away all the data you need from the response, while not consuming too much memory and yielding this more elegantly looking code:

var request = http.get(url, function(response) {
    response.pipe(yourTransformStream).pipe(file)
});

Implementing this transform stream, however, might prove slightly more complex. So if you're a node beginner and you don't plan on downloading big files or lots of small files than maybe loading the whole thing into memory and doing string manipulations on it might be simpler.

For further information about transformation streams:

Lastly, see if you can use any of the million node.js crawlers already out there :-) take a look at these search results on npm

Community
  • 1
  • 1
Yaniv Kessler
  • 838
  • 8
  • 10
  • Thanks I've spent the last 2 days reading what you pointed me out, and think I'm going the wrong way here... – Marcos Feb 13 '14 at 14:17
0

According to the http module help 'get' does not return the response body

This is modified from the request example on the same page

What you need to do is process the response with in the callback (function) passed into http.request so it can be called when it is ready (async)

var http = require('http')
var fs = require('fs')

var localFile = 'tmp/scraped_site_.html'
var file = fs.createWriteStream(localFile)

var req = http.request('http://www.google.com.au', function(res) {
  res.pipe(file)
  res.on('end', function(){
    file.end()

    fs.readFile(localFile, function(err, buf){
      console.log(buf.toString())
    })

  })
})

req.on('error', function(e) {
  console.log('problem with request: ' + e.message)
})

req.end();

EDIT I updated the example to read the file after it is created. This works by having a callback on the end event of the response which closes the pipe and then it can reopen the file for reading. Alternatively you can use

 req.on('data', function(chunk){...})

to process the data as it arrives without putting it into a temporary file

KeepCalmAndCarryOn
  • 8,817
  • 2
  • 32
  • 47
-1

My impression is that you serializing a js object into JSON by reading it from a stream that's downloading a file containing HTML. This is do-able yet hard. Its difficult to know when you're search expression is found because if you parse as the chunks come in then you never know if you received only context and you could never find what you're looking for because it was split into 2 or many parts which were never analyzed as a whole.

You could try something like this:

http.request('u/r/l',function(res){
   res.on('data',function(data){
      //parse data as it comes in
   }
});

This allows you to read data as it comes in. You can handle it to save to disc, db, or even parse it if you accumulated the contents within the script tags into a single string then parsed objects in that.

tsturzl
  • 3,089
  • 2
  • 22
  • 35