1

Any help will be appreciated.

I need to extract data from websites and found that node-unfluff does the job (see https://github.com/ageitgey/node-unfluff). There is two ways to call this module.

First, from command line which works! Second, from node js which doesn't work.

extractor = require('unfluff');
data = extractor('test.html');
console.log(data);

Output : {"title":"","lang":null,"tags":[],"image":null,"videos":[],"text":""}

The data returns an empty json object. It appears like it cannot read the test.html.

It seems like it doesn't recognise test.html. The example says, "my html data", is there a way to get html data ? Thanks.

loganfsmyth
  • 156,129
  • 30
  • 331
  • 251
Zizi
  • 51
  • 5

1 Answers1

1

From the docs of unfluff:

extractor(html, language)

html: The html you want to parse

language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.

You are passing a filename, and it expects the actual HTML of the file to be passed in.

If you are doing this in a scripting context, I'd recommend doing

data = extractor(fs.readFileSync('test.html'));

however if you are doing this in the context of a server or some time when blocking will be an issue, you should do:

fs.readFile('test.html', function(err, html){
    var data = extractor(html);
    console.log(data);
));
loganfsmyth
  • 156,129
  • 30
  • 331
  • 251
  • Thanks so much for the reply. How do you pass in the actual HTML ? When I read the docs, it says, html, I thought it meant, the name of the file . Any help to get me started will be very much appreciated. – Zizi Mar 06 '15 at 22:46
  • @Jessi You can use the [`readFile`](https://nodejs.org/api/fs.html#fs_fs_readfile_filename_options_callback) method of the `fs` module for reading the file. http://stackoverflow.com/questions/10058814/get-data-from-fs-readfile – Ram Mar 06 '15 at 22:51
  • Thanks so much. I have been working on it since yesterday night. – Zizi Mar 06 '15 at 23:33