0

How do you convert html to text efficiently using NodeJS, i.e. outside of the browser? I also want to convert entities like ä to ä, etc and not only just remove tags from the html.

Here is a JEST unit test for a a function convertHtmlToText which does this conversion:

it('when extract from partial html should extract text', () => {
  const html = `<p>&nbsp;&auml;&uuml;
\t<img alt="" src="http://www.test.org:80/imageupload/userfiles/2/images/world med new - 2022.jpg" style="width: 2000px; height: 1047px; max-width: 100%; height: auto;" /></p>
<p>
\tAn evening of music, silence and guiding thoughts to help us experience inner peace, connect with the Divine and share loving vibrations with the world. Join millions of people throughout the world to contribute in creating a wave of peace.</p>
<div>
\t&nbsp;</div>
<div>
\t<strong>Please join ....</strong></div>
<div>
\t&nbsp;</div>
<div>
\t<strong>Watch live:&nbsp;<a href="https://test.org/watchlive" target="_blank">test.org/watchlive</a></strong></div>`
  const text = convertHtmlToText(html)
  console.log(text)
  expect(text).toContain("ä");
  expect(text).toContain("ü");
  expect.not.stringContaining("<")
  expect.not.stringContaining(">")
});
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76

2 Answers2

2

One possible solution for this question would be to use a library like e.g: jsdom

This is the function which removes tags and also converts entities from any html text:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const convertHtmlToText = (html) => {
  if(!html) {
    return ""
  }
  const dom = new JSDOM(html)
  const textContent = dom.window.document.documentElement.textContent
  // removing unnecessary spaces
  return textContent.replace(/\s+/gm, ' ').trim()
}

module.exports = {
  convertHtmlToText
}
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76
-1

let HTMLContent = `<div> my&apos; <a href="profile/lol">profile</a></div>`;

let strippedHtml = decodeHTMLEntities(HTMLContent.replace(/<[^>]+>/g, ''));
console.log(strippedHtml)

function decodeHTMLEntities(text) {
  var entities = [
    ['amp', '&'],
    ['apos', '\''],
    ['#x27', '\''],
    ['#x2F', '/'],
    ['#39', '\''],
    ['#47', '/'],
    ['lt', '<'],
    ['gt', '>'],
    ['nbsp', ' '],
    ['quot', '"']
  ];

  for (var i = 0, max = entities.length; i < max; ++i) {
    text = text.replace(new RegExp('&' + entities[i][0] + ';', 'g'), entities[i][1]);
  }
  return text;
}

try this

Segun Adeniji
  • 370
  • 5
  • 11
  • Hello, this is not bad, but I also want entities like ` ä&uuml` to be properly converted to text. – gil.fernandes Feb 18 '22 at 11:19
  • sorry syntax fixed, try this. thanks – Segun Adeniji Feb 18 '22 at 11:36
  • [That's very fragile](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) (e.g. it will break if an attribute value contains `>`), the list of supported entities is very short, and it doesn't process whitespace correctly. – Quentin Feb 18 '22 at 11:37
  • @Quentin can you provide him with a better solution? he is using node.js, not browser js that has DOM to manipulate – Segun Adeniji Feb 18 '22 at 11:39
  • @SegunAdeniji — gil.fernandes already has – Quentin Feb 18 '22 at 11:40