Web scraping/parsing in Node.js to detect the language of a HTML page?

Question

I am using the Readability Parser API and the node-readability module to do web scraping/parsing for a server built on Node.js. I can get much information (title, links, date, content, length...) about the articles published on sites of publishers and blogs (my target), but cannot get their written language. Any idea of how I could do this?

There is the Google Translate API, but it is not free, and I don't need any translation. There is the Alchemy Language Detection API, or there is the node-language-detect module, but it seems to detect language from a given text, whereas in my case some information about the language may be available in the HTML code of the page (see http://www.w3.org/TR/i18n-html-tech-lang/).

score 2 · Answer 1 · answered Apr 29 '14 at 21:39

2

You could make a request for the linked content and then get language from the HTTP response headers.

Some servers will response with a Content-Language header. HTTP Headers.

answered Apr 29 '14 at 21:39

Daniel

38,041
11
92
73

Thanks. I will make a request to a given URL. Should I then look for HTML lang attribute or for HTTP response headers? – GBC Apr 29 '14 at 22:02

score 2 · Accepted Answer · edited May 23 '17 at 12:11

2

While inferring the language of a web page can be difficult (Bonjour!), HTML is there to help. Look for the lang attribute:

<html lang="en-us">

It should be noted that any element can have said attribute. In the case of my opening sentence:

<p lang="en-us">While inferring the language of a web page can be difficult <span lang="fr">(Bonjour!)</span></p>

More info here: https://stackoverflow.com/a/7076990/1216976

Alternatively, you could check the Content-Language of the return headers, but that's not as specific, defining the entire page.

edited May 23 '17 at 12:11

Community

1
1

answered Apr 29 '14 at 21:39

SomeKittens

38,868
19
114
143

Thanks. I will make a request to a given URL. Should I then look for HTML lang attribute or for HTTP response headers? – GBC Apr 29 '14 at 22:00
Thanks. This code works: `var request = require("request"); var cheerio = require("cheerio"); var url = 'http://www.vox.com/cards/israel-palestine/peace-process-failure'; request(url, function(err, resp, html){ $ = cheerio.load(html); var langue = $("html").attr("lang"); console.log(langue); });` – GBC Apr 30 '14 at 14:55

Web scraping/parsing in Node.js to detect the language of a HTML page?

2 Answers2