Getting the actual human-readable text from XML/HTML?

Question

I'm trying to extract the text that is actually suppose to be read by humans from an epub(very similair to html) So far, I manaed to get rid of multiple spaces and hidden chars like linebreaks etc. I just started working on style-tags(not sure what else might need to be solved), when I realised, someone has probably already done this better than I can. Is there a library I could use?

let dom = new DOMParser().parseFromString(string, "text/xml")
    .documentElement;
let styles = dom.getElementsByTagName("style");

text = dom.textContent
    .replace(/[\n\r]+|[\s]{2,}/g, " ") //Get rid of hidden characters
    .replace(/ {1,}/g, " ") // multiple spaces should be just one.
    .split(" "); //make array;

    console.log(text);
});

Have you thought of using [an Epub parser](https://github.com/futurepress/epub.js/)? It's [not advisable to parse XML or HTML with Regex](https://stackoverflow.com/a/1732454/12251171). — , Oct 26 '19 at 22:52
yeah, I looked at Epub, it doesn't really deal with this either as far as I can tell. If you just want to render the text, my think won't be an issue. I'm not really using regex to parse the text, that's why I get the stylenodes with `getElementbytagname` and also textContent — Himmators, Oct 27 '19 at 08:06

Getting the actual human-readable text from XML/HTML?

0 Answers0