33

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

  • it may come down to using regex as you have listed for the bulk of replaces and then using a specified list replaces, such as :active; to complete the cleanse. – Valamas Mar 03 '13 at 04:28
  • http://stackoverflow.com/a/29706729/3338098 preserves new-lines and strips html tags – user3338098 Apr 17 '15 at 18:38

7 Answers7

37

This simple regular expression works:

text.replace(/<[^>]*>/g, '');

It removes all anchors.

Entities, like &lt; does not contains <, so there is no issue with this regex.

Gaël Barbin
  • 3,769
  • 3
  • 25
  • 52
18

Converter HTML to plain text like Gmail:

html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, '  *  ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');

If you can use jQuery :

var html = jQuery('<div>').html(html).text();
EpokK
  • 38,062
  • 9
  • 61
  • 69
  • The DOM conversion is problematic the way you do it. This will load all links in the HTML snippet, if the html is not sanitized. This should be done via a document fragment that's not attached to the DOM. –  Nov 19 '13 at 13:41
  • doesn't add a `\n` in `TEXT1
    TEXT2
    `, i.e. it returns `TEXT1TEXT2\n`
    – user3338098 Apr 17 '15 at 18:29
  • +1 for good answer. but i want to also replace more than one new line character to one in above code. please help – Satish Sharma Aug 07 '15 at 12:03
  • `var html = jQuery(html).text();` is more simple. – est Jan 12 '16 at 05:09
  • The replace method works for me. The versions using jQuery html(...) or document.createElement(...) all seem to load images and scripts that may be included in the content, which is a waste of time and potential security risk (I use this function to display sample content from user input) – Etherman Feb 28 '20 at 11:18
  • I think this answer misses quoting with `>` character – Yuri Tinyukov Jun 08 '20 at 12:11
  • it should also handle links (href from tags) – Bruno Lemos Sep 30 '20 at 06:02
  • Doesn't this miss comment tags? – actinidia Feb 04 '21 at 09:42
11

With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

In node.js it looks like:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

(I copied the example from the page, you will have to npm install the module first.)

gyula.nemeth
  • 847
  • 10
  • 9
6

You can try this way. textContent with innerText neither of them compatible with all browsers:

var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Dostonbek Oripjonov
  • 1,508
  • 1
  • 12
  • 28
3

Updated @EpokK answer for html to email text version use-case

const htmltoText = (html: string) => {
  let text = html;
  text = text.replace(/\n/gi, "");
  text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
  text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
  text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
  text = text.replace(/<\/div>/gi, "\n\n");
  text = text.replace(/<\/li>/gi, "\n");
  text = text.replace(/<li.*?>/gi, "  *  ");
  text = text.replace(/<\/ul>/gi, "\n\n");
  text = text.replace(/<\/p>/gi, "\n\n");
  text = text.replace(/<br\s*[\/]?>/gi, "\n");
  text = text.replace(/<[^>]+>/gi, "");
  text = text.replace(/^\s*/gim, "");
  text = text.replace(/ ,/gi, ",");
  text = text.replace(/ +/gi, " ");
  text = text.replace(/\n+/gi, "\n\n");
  return text;
};

Melounek
  • 764
  • 4
  • 20
1

If you want something accurate and can use npm packages, I would use html-to-text.

From the README:

const { htmlToText } = require('html-to-text');

const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
  wordwrap: 130
});
console.log(text); // Hello World

FYI, I found this on npm trends; html-to-text seemed like the best option for my use case but you can check out others here.

Killian Huyghe
  • 1,422
  • 9
  • 13
-3

It's pretty simple, you can also implement a "toText" prototype:

String.prototype.toText = function(){
    return $(html).text();
};

//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"