Extracting text of HTML document in node js using REGEX

Question

I am writing a code to extract all the plain content from the tag of the HTML code.I know it can be done using the document element. But I need to do this using REGEX only I have written the following code, but it has some bugs which I am not able to figure out on how to solve it.

function htmlToText(html) {
      return html.
        replace(/(.|\n)*<body.*>/, ''). //remove up till body
        replace(/<\/body(.|\n)*/, ''). //remove from </body
        replace(/<.+\>/, ''). //remove tags
        replace(/^\s\n*$/gm, '');  //remove empty lines
    }

Here is the solution for it

function htmlToText(html) {
          return html.
            replace(/(.|\n)*<body.*>/, ''). //remove up till body
            replace(/<\/body(.|\n)*/g, ''). //remove from </body
            replace(/<.+\>/g, ''). //remove tags
            replace(/^\s\n*$/gm, '');  //remove empty lines
        }

In the general case, you cannot parse HTML accurately with a regular expression. You'd be better off letting something (the browser itself, if that's where your code runs) parse the HTML for you, and then you can traverse the DOM looking for text nodes. — Pointy, Sep 20 '18 at 13:18
Just use `document.getElementsByTagName("body")[0].innerText ` — Arun Kumar, Sep 20 '18 at 13:19
I am not running this on a client. I am parsing the HTML code as a normal string — Dipesh Desai, Sep 24 '18 at 04:08

score 3 · Accepted Answer · answered Sep 20 '18 at 13:21

3

No need to over think it, you can just document.body.innerText

A Sample Document
Some strong and emphasized text

JSFiddle example

answered Sep 20 '18 at 13:21

scniro

16,844
8
62
106

Would be glad if you could help me out with a REGEX solution – Dipesh Desai Sep 24 '18 at 04:08
@DipeshDesai that would be an unwise implementation. Smarter not harder – scniro Dec 06 '18 at 14:17

Extracting text of HTML document in node js using REGEX

1 Answers1