What is the most convenient way to convert HTML to plain text while preserving line breaks (with JavaScript)?

Question

Basically I just need the effect of copying that HTML from browser window and pasting it in a textarea element.

For example I want this:

<p>Some</p>
<div>text<br />Some</div>
<div>text</div>

to become this:

Some
text
Some
text

The problem you're going to have is the order the text appears. How something lays out is not always related to the markup hierarchy. — AutoSponge, Sep 28 '10 at 14:14
possible duplicate of [Strip HTML from Text JavaScript](http://stackoverflow.com/questions/822452/strip-html-from-text-javascript) — bdukes, Sep 28 '10 at 14:56

Tim Down · Accepted Answer · 2017-12-01T10:13:10.500

21

If that HTML is visible within your web page, you could do it with the user selection (or just a TextRange in IE). This does preserve line breaks, if not necessarily leading and trailing white space.

UPDATE 10 December 2012

However, the toString() method of Selection objects is not yet standardized and works inconsistently between browsers, so this approach is based on shaky ground and I don't recommend using it now. I would delete this answer if it weren't accepted.

Demo: http://jsfiddle.net/wv49v/

Code:

function getInnerText(el) {
    var sel, range, innerText = "";
    if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
        range = document.body.createTextRange();
        range.moveToElementText(el);
        innerText = range.text;
    } else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
        sel = window.getSelection();
        sel.selectAllChildren(el);
        innerText = "" + sel;
        sel.removeAllRanges();
    }
    return innerText;
}

edited Dec 01 '17 at 10:13

answered Sep 28 '10 at 13:57

Tim Down

318,141
75
454
536

Thank’s. Interestingly, in non-IE case (first block) it gets what would be copied into clipboard, but in IE case (second block) it’s not the same string. – Danylo Mysak Sep 28 '10 at 14:28
What's the difference between the IE and non-IE strings? The first block uses Selection's `toString()` method to extract just the text of the selection (rather than the rich text that gets copied to the clipboard), so they should be more or less identical. – Tim Down Sep 28 '10 at 15:18
Sorry, I meant that the string which you get by copying a fragment of page in clipboard differs from one that your function returns. And this is the case with IE, for non-IE browsers these two strings are identical. The function itself is perfect for the problem I described in my question (except for IE stuff, which is not so important). – Danylo Mysak Sep 28 '10 at 16:18
Unfortunately, it turned out that my real problem is quite different and probably can’t be solved this way. I need two paragraphs of text, both with margin: 0, to be recognized as two consecutive lines without an empty line between them. It seems like WebKit-browsers are the only browsers that take 'margin' parameter into consideration. – Danylo Mysak Sep 28 '10 at 16:19
Ah. I don't have an easy answer for that. – Tim Down Sep 28 '10 at 17:53
this can't keep line breaks – hienbt88 Dec 10 '12 at 07:32
2

@hienbt88: It's certainly built on shaky foundations: Selection.toString() isn't standardized, works differently between browsers and does not preserve line breaks in IE 9 (released since the original version of this answer was written). However, it still does preserve line breaks in current versions of Mozilla, WebKit and Opera, and since I tweaked it just now, IE. I wouldn't recommend this approach for the long term, to be honest. – Tim Down Dec 10 '12 at 09:43
This solution works really well compared to: http://stackoverflow.com/questions/4502673/jquery-text-function-loses-line-breaks-in-ie I am getting better results for me then with other methods and in safari and chrome is seems to work ok. Since the post is about 1 year old, any updates on the stability of this solution? – Nearpoint Jul 20 '14 at 04:23
@nearpoint: Nothing much has changed since my last comment, as far as I'm aware. If you use this approach, you're at the mercy of browser developers. – Tim Down Jul 20 '14 at 16:44
Thanks, so far it seems to work on pretty recent versions of Firefox, Safari, and Chrome on Mac. I suppose it would be the same for windows versions. And it looks like you got IE working. Are you aware of any issues in certain browsers? As far as I can tell it works great and I want to use it, but I want to be aware of what issues are out there to watch for. – Nearpoint Jul 20 '14 at 16:48
@nearpoint: I'm not aware of specific issues, but the kind of thing to watch out for would be how different browsers handle things like table cells (possibly these will be separated by tabs in string representations), contents of ` – Tim Down Jul 20 '14 at 21:12
@TimDown calling sel.removeAllRanges() also makes the passed element to lose focus. – Saif Jan 07 '19 at 09:55
1

@SaifUllah: Yes. This answer was never a particularly good idea. – Tim Down Jan 07 '19 at 10:36

Kevin Wiskia · Answer 2 · 2010-09-28T14:41:32.313

I tried to find some code I wrote for this a while back that I used. It worked nicely. Let me outline what it did, and hopefully you could duplicate its behavior.

Replace images with alt or title text.
Replace links with "text[link]"
Replace things that generally produce vertical white space. h1-h6, div, p, br, hr, etc. (I know, I know. These could actually be inline elements, but it works out well.)
Strip out the rest of the tags and replace with an empty string.

You could even expand this more to format things like ordered and unordered lists. It really just depends on how far you'll want to go.

EDIT

Found the code!

public static string Convert(string template)
{
    template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", "$1"); /* Use image alt text. */
    template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", "$2 [$1]"); /* Convert links to something useful */
    template = Regex.Replace(template, "<(/p|/div|/h\\d|br)\\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
    template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */

    return template;
}

Erm... that's not Javascript isn't it? Also doesn't directly answer the question, given that question really concerns copy and paste — Yi Jiang, Sep 28 '10 at 14:00
The language really doesn't matter, it's how its going about it. This could easily be ported to JS. I'm just showing something I had done in the past. — Kevin Wiskia, Sep 28 '10 at 14:06
Thank you. That’s quite like it. Although, unfortunately, the result is not exactly what user sees. For example, Convert('
Some
text
') and Convert('
Some
text
') give different results while browser renders those the same way. — Danylo Mysak, Sep 28 '10 at 14:13

chrmcpn · Answer 3 · 2018-06-12T17:21:14.717

I made a function based on this answer: https://stackoverflow.com/a/42254787/3626940

function htmlToText(html){
    //remove code brakes and tabs
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    //keep html brakes and tabs
    html = html.replace(/<\/td>/g, "\t");
    html = html.replace(/<\/table>/g, "\n");
    html = html.replace(/<\/tr>/g, "\n");
    html = html.replace(/<\/p>/g, "\n");
    html = html.replace(/<\/div>/g, "\n");
    html = html.replace(/<\/h>/g, "\n");
    html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");

    //parse html into text
    var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
    return dom.body.textContent;
}

For the `` replacement, use two line breaks instead of one: `"\n\n"`. — Moxley Stratton, Oct 08 '21 at 00:23

score 1 · Answer 4 · answered Mar 05 '19 at 13:16

Based on chrmcpn answer, I had to convert a basic HTML email template into a plain text version as part of a build script in node.js. I had to use JSDOM to make it work, but here's my code:

const htmlToText = (html) => {
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    html = html.replace(/<\/p>/g, "\n\n");
    html = html.replace(/<\/h1>/g, "\n\n");
    html = html.replace(/<br>/g, "\n");
    html = html.replace(/<br( )*\/>/g, "\n");

    const dom = new JSDOM(html);
    let text = dom.window.document.body.textContent;

    text = text.replace(/  /g, "");
    text = text.replace(/\n /g, "\n");
    text = text.trim();
    return text;
}

score -2 · Answer 5 · answered Sep 28 '10 at 13:37

-2

Three steps.

First get the html as a string.
Second, replace all <BR /> and <BR> with \r\n.
Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".

answered Sep 28 '10 at 13:37

Serapth

7,122
4
31
39

Unfortunately, this approach ignores line breaks that emerge between two paragraphs or divs. – Danylo Mysak Sep 28 '10 at 13:44
Is that not as easily solved by inserting a hard break after each close P and DIV tag before doing the regex replace? – Serapth Sep 28 '10 at 13:47
Well, the problem is a bit deeper. I need to get text which resembles what user sees on a screen. For example, if there are two paragraphs ('p' elements) and they both have standard margin I want to get two line breaks between corresponding text fragments. But when the margin is 0 it needs to be a single line break. That’s how clipboard works — at least in some browsers. – Danylo Mysak Sep 28 '10 at 13:56

What is the most convenient way to convert HTML to plain text while preserving line breaks (with JavaScript)?

5 Answers5

Linked