I'm trying to extract the text that is actually suppose to be read by humans from an epub(very similair to html) So far, I manaed to get rid of multiple spaces and hidden chars like linebreaks etc. I just started working on style-tags(not sure what else might need to be solved), when I realised, someone has probably already done this better than I can. Is there a library I could use?
let dom = new DOMParser().parseFromString(string, "text/xml")
.documentElement;
let styles = dom.getElementsByTagName("style");
text = dom.textContent
.replace(/[\n\r]+|[\s]{2,}/g, " ") //Get rid of hidden characters
.replace(/ {1,}/g, " ") // multiple spaces should be just one.
.split(" "); //make array;
console.log(text);
});