How to get text content of the entire document?

Question

I am building Chrome extension which at some point should determine current page language. In order to do that, my plan is to extract text content of the page (or at least a part of it) and pass it to translation api. However I couldn't find any strait forward way to just get all textNodes of the document.

There is a backup plan which is to recursively analyze $('body').contents() until there is enough text content, but it feels a bit flaky. Perhaps there is a better way?

Note: Chrome extensions api allows your script to access user page dom as if it was the part of it.

is there way you could use python executables in chrome-extension development? If so you can use `SGMLParser` from `sgmllib` module to achieve that! not very sure how to do this using js — Shiv Deepak, Nov 20 '10 at 15:35

score 31 · Answer 1 · answered Nov 03 '13 at 09:12

31

Javascript:

document.body.textContent

answered Nov 03 '13 at 09:12

mortalis

2,060
24
34

1

For me, in 2021 on Chrome, this gets a lot more than just the text. A quick test of this on wikipedia, for example, manages to extract a lot of CSS & code in addition to the text on the page. `document.body.innerText`, however, works cleanly. – Josh Desmond Mar 15 '21 at 21:30
Here is some info about `innerText`, `textContent` and the differences: [HTMLElement.innerText](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText), [textContent and innerText differences](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent#differences_from_innertext) – mortalis Mar 16 '21 at 19:24

score 18 · Answer 2 · answered Nov 20 '10 at 16:18

18

Without jQuery, just as easy: document.body.innerText;

answered Nov 20 '10 at 16:18

pawel

35,827
7
56
53

8

innerText for IE only, document.body.textContent otherwise – kennebec Nov 20 '10 at 16:58
According to PPK, both are more or less cross-browser (innerText being absent in Firefox, textContent in IE) http://www.quirksmode.org/dom/w3c_html.html – pawel Nov 20 '10 at 17:42
2

They're different though: http://stackoverflow.com/questions/1359469/innertext-works-in-ie-but-not-in-firefox/1359822#1359822 – Tim Down Nov 21 '10 at 16:36
innerText is now implemented in all browsers. It works great and I'd recommend it. See https://caniuse.com/innertext. Many tutorials are still outdated and mention its lack of compatibility, but, no longer! – Josh Desmond Mar 21 '21 at 02:20

John Hartsock · Accepted Answer · 2013-02-08T19:48:54.733

7

Using the jQuery text() method

$('body').text()

edited Feb 08 '13 at 19:48

answered Nov 20 '10 at 15:38

John Hartsock

85,422
23
131
146

Sorry to nitpick, but you want: `$('body').text()` – szeryf Feb 08 '13 at 19:20

score 1 · Answer 4 · answered Nov 21 '19 at 17:40

1

VanillaJS:

document.body.outerHTML

answered Nov 21 '19 at 17:40

guerrerocarlos

1,374
12
6

score 0 · Answer 5 · answered Mar 14 '18 at 07:29

0

You can use chrome.tabs.detectLanguage(integer tabId, function callback).

answered Mar 14 '18 at 07:29

Vitaly Zdanevich

13,032
8
47
81

score 0 · Answer 6 · answered Jan 04 '21 at 07:11

0

all these methods return undefined when attempted in the console with chrome.

var text = document.body.textContent;
var text = document.body.outerHTML;
var text = document.body.innerText;

etc ...

answered Jan 04 '21 at 07:11

DennisWPaulsenJR

465
1
5
16

The statement itself, `var text = document.body.innerText;` will return undefined, just as the statement `var i = 5;` will return undefined. Simply type `document.body.innerText` in the console and you will see the output. – Josh Desmond Mar 15 '21 at 21:33

How to get text content of the entire document?

6 Answers6