0

I am writing a code to extract all the plain content from the tag of the HTML code.I know it can be done using the document element. But I need to do this using REGEX only I have written the following code, but it has some bugs which I am not able to figure out on how to solve it.

function htmlToText(html) {
      return html.
        replace(/(.|\n)*<body.*>/, ''). //remove up till body
        replace(/<\/body(.|\n)*/, ''). //remove from </body
        replace(/<.+\>/, ''). //remove tags
        replace(/^\s\n*$/gm, '');  //remove empty lines
    }

Here is the solution for it

function htmlToText(html) {
          return html.
            replace(/(.|\n)*<body.*>/, ''). //remove up till body
            replace(/<\/body(.|\n)*/g, ''). //remove from </body
            replace(/<.+\>/g, ''). //remove tags
            replace(/^\s\n*$/gm, '');  //remove empty lines
        }
Dipesh Desai
  • 104
  • 1
  • 12
  • In the general case, you cannot parse HTML accurately with a regular expression. You'd be better off letting something (the browser itself, if that's where your code runs) parse the HTML for you, and then you can traverse the DOM looking for text nodes. – Pointy Sep 20 '18 at 13:18
  • 1
    Just use `document.getElementsByTagName("body")[0].innerText ` – Arun Kumar Sep 20 '18 at 13:19
  • I am not running this on a client. I am parsing the HTML code as a normal string – Dipesh Desai Sep 24 '18 at 04:08

1 Answers1

3

No need to over think it, you can just document.body.innerText

A Sample Document
Some strong and emphasized text

JSFiddle example

scniro
  • 16,844
  • 8
  • 62
  • 106