10

I am building Chrome extension which at some point should determine current page language. In order to do that, my plan is to extract text content of the page (or at least a part of it) and pass it to translation api. However I couldn't find any strait forward way to just get all textNodes of the document.

There is a backup plan which is to recursively analyze $('body').contents() until there is enough text content, but it feels a bit flaky. Perhaps there is a better way?


Note: Chrome extensions api allows your script to access user page dom as if it was the part of it.

artemave
  • 6,786
  • 7
  • 47
  • 71
  • is there way you could use python executables in chrome-extension development? If so you can use `SGMLParser` from `sgmllib` module to achieve that! not very sure how to do this using js – Shiv Deepak Nov 20 '10 at 15:35
  • what do you do with the complete html of page??? – kobe Nov 20 '10 at 16:24

6 Answers6

31

Javascript:

document.body.textContent
mortalis
  • 2,060
  • 24
  • 34
  • 1
    For me, in 2021 on Chrome, this gets a lot more than just the text. A quick test of this on wikipedia, for example, manages to extract a lot of CSS & code in addition to the text on the page. `document.body.innerText`, however, works cleanly. – Josh Desmond Mar 15 '21 at 21:30
  • Here is some info about `innerText`, `textContent` and the differences: [HTMLElement.innerText](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText), [textContent and innerText differences](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent#differences_from_innertext) – mortalis Mar 16 '21 at 19:24
18

Without jQuery, just as easy: document.body.innerText;

pawel
  • 35,827
  • 7
  • 56
  • 53
  • 8
    innerText for IE only, document.body.textContent otherwise – kennebec Nov 20 '10 at 16:58
  • According to PPK, both are more or less cross-browser (innerText being absent in Firefox, textContent in IE) http://www.quirksmode.org/dom/w3c_html.html – pawel Nov 20 '10 at 17:42
  • 2
    They're different though: http://stackoverflow.com/questions/1359469/innertext-works-in-ie-but-not-in-firefox/1359822#1359822 – Tim Down Nov 21 '10 at 16:36
  • innerText is now implemented in all browsers. It works great and I'd recommend it. See https://caniuse.com/innertext. Many tutorials are still outdated and mention its lack of compatibility, but, no longer! – Josh Desmond Mar 21 '21 at 02:20
7

Using the jQuery text() method

$('body').text()
John Hartsock
  • 85,422
  • 23
  • 131
  • 146
1

VanillaJS:

document.body.outerHTML
guerrerocarlos
  • 1,374
  • 12
  • 6
0

You can use chrome.tabs.detectLanguage(integer tabId, function callback).

Vitaly Zdanevich
  • 13,032
  • 8
  • 47
  • 81
0

all these methods return undefined when attempted in the console with chrome.

var text = document.body.textContent;
var text = document.body.outerHTML;
var text = document.body.innerText;

etc ...

DennisWPaulsenJR
  • 465
  • 1
  • 5
  • 16
  • The statement itself, `var text = document.body.innerText;` will return undefined, just as the statement `var i = 5;` will return undefined. Simply type `document.body.innerText` in the console and you will see the output. – Josh Desmond Mar 15 '21 at 21:33