Convert HTML to plain text in JS without browser environment

Question

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

it may come down to using regex as you have listed for the bulk of replaces and then using a specified list replaces, such as :active; to complete the cleanse. — Valamas, Mar 03 '13 at 04:28
http://stackoverflow.com/a/29706729/3338098 preserves new-lines and strips html tags — user3338098, Apr 17 '15 at 18:38

Gaël Barbin · Answer 1 · 2021-08-17T00:21:52.820

37

This simple regular expression works:

text.replace(/<[^>]*>/g, '');

It removes all anchors.

Entities, like < does not contains <, so there is no issue with this regex.

edited Aug 17 '21 at 00:21

answered Mar 02 '13 at 22:31

Gaël Barbin

3,769
3
25
52

5

there are also entities to take care of – Meisner Nov 30 '16 at 16:12
worked! but its a challenge to parse those html texts in which user has placed a word inside '<>'. – Gaurav Gupta Jan 20 '20 at 05:40
Works for me for formatted HTML error messages from PHP running in Ajax. – David Spector Oct 31 '20 at 15:51

EpokK · Accepted Answer · 2013-11-19T13:31:09.533

18

Converter HTML to plain text like Gmail:

html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, '  *  ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');

If you can use jQuery :

var html = jQuery('<div>').html(html).text();

edited Nov 19 '13 at 13:31

answered Nov 19 '13 at 12:36

EpokK

38,062
9
61
69

The DOM conversion is problematic the way you do it. This will load all links in the HTML snippet, if the html is not sanitized. This should be done via a document fragment that's not attached to the DOM. – Nov 19 '13 at 13:41
doesn't add a `\n` in `TEXT1
TEXT2
`, i.e. it returns `TEXT1TEXT2\n` – user3338098 Apr 17 '15 at 18:29
+1 for good answer. but i want to also replace more than one new line character to one in above code. please help – Satish Sharma Aug 07 '15 at 12:03
`var html = jQuery(html).text();` is more simple. – est Jan 12 '16 at 05:09
The replace method works for me. The versions using jQuery html(...) or document.createElement(...) all seem to load images and scripts that may be included in the content, which is a waste of time and potential security risk (I use this function to display sample content from user input) – Etherman Feb 28 '20 at 11:18
I think this answer misses quoting with `>` character – Yuri Tinyukov Jun 08 '20 at 12:11
it should also handle links (href from tags) – Bruno Lemos Sep 30 '20 at 06:02
Doesn't this miss comment tags? – actinidia Feb 04 '21 at 09:42

gyula.nemeth · Answer 3 · 2016-07-27T16:24:04.550

11

With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

In node.js it looks like:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

(I copied the example from the page, you will have to npm install the module first.)

edited Jul 27 '16 at 16:24

answered Jul 27 '16 at 12:14

gyula.nemeth

847
10
9

Note: it converts links to markup, so it's not quite "plain" text. Still helpful. – dfrankow Nov 30 '17 at 01:27
It also passes HTML entities right through: `<` should be translated to `<` but instead is left as `<`. – Greg Price Apr 20 '20 at 19:41

score 6 · Answer 4 · edited Apr 13 '18 at 02:31

6

You can try this way. textContent with innerText neither of them compatible with all browsers:

var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";

edited Apr 13 '18 at 02:31

Stephen Rauch

47,830
31
106
135

answered Apr 13 '18 at 02:11

Dostonbek Oripjonov

1,508
1
12
28

3

This doesn't address the question "without browser environment". – devansvd Aug 10 '20 at 19:31

score 3 · Answer 5 · answered Dec 04 '20 at 22:02

Updated @EpokK answer for html to email text version use-case

const htmltoText = (html: string) => {
  let text = html;
  text = text.replace(/\n/gi, "");
  text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
  text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
  text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
  text = text.replace(/<\/div>/gi, "\n\n");
  text = text.replace(/<\/li>/gi, "\n");
  text = text.replace(/<li.*?>/gi, "  *  ");
  text = text.replace(/<\/ul>/gi, "\n\n");
  text = text.replace(/<\/p>/gi, "\n\n");
  text = text.replace(/<br\s*[\/]?>/gi, "\n");
  text = text.replace(/<[^>]+>/gi, "");
  text = text.replace(/^\s*/gim, "");
  text = text.replace(/ ,/gi, ",");
  text = text.replace(/ +/gi, " ");
  text = text.replace(/\n+/gi, "\n\n");
  return text;
};

score 1 · Answer 6 · answered Feb 28 '21 at 00:30

If you want something accurate and can use npm packages, I would use html-to-text.

From the README:

const { htmlToText } = require('html-to-text');

const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
  wordwrap: 130
});
console.log(text); // Hello World

FYI, I found this on npm trends; html-to-text seemed like the best option for my use case but you can check out others here.

score -3 · Answer 7 · answered Feb 27 '16 at 19:31

-3

It's pretty simple, you can also implement a "toText" prototype:

String.prototype.toText = function(){
    return $(html).text();
};

//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"

answered Feb 27 '16 at 19:31

Alberto Di Cagno

1

1

really don't see how this answer is relevant. – sps Jul 27 '17 at 10:25

Convert HTML to plain text in JS without browser environment

7 Answers7

Linked