1

I have a 2.2mb html file, pure trash generated by acrobat. I need to span every word that is in it. But I keep getting that the html page starts showing parts of the source code.

Here is a small example:

<p class="s21" style="padding-top: 10pt;padding-left: 31pt;text-indent: 0pt;text-align: left;">CONTINGENCY TIMEL
        INES.. • • • • • •• • • • • • • • • • • •• • • • • • ••• • •• • • • • •• • • • • •• • •<span class="s25">
        </span><span class="s26"> </span>4-<span class="s27">1</span></p>
.word:hover {
    background-color: rgba(0,0,0,0.1);
}
const walkDOM = function (node, func) {
    func(node);
    node = node.firstChild;
    while(node) {
        walkDOM(node, func);
        node = node.nextSibling;

        if (node && node.nextSibling == undefined) {
            // console.log(node.innerHTML);
        }
    }
};



walkDOM(document.body, function(node) {

    if (node.nodeName == '#text') {

        let pnode = node.parentElement;
        pnode.innerHTML = pnode.innerHTML.replace(/(^|<\/?[^>]+>|\s+)([^\s<]+)/g, '$1<span class="word">$2</span>');

    }

});

https://codepen.io/clankill3r/pen/rNaNmxE

Outputs:

• • ••• • •• • • • • •• • • • • •• • •class="s25"> class="s26"> 4-1

Is there any way of spanning each individual word without having to much pain of avoiding html tags?

clankill3r
  • 9,146
  • 20
  • 70
  • 126
  • REg Exps with HTML markup is the wrong way of doing things. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ – epascarello Dec 02 '19 at 14:32
  • Seems like you've also managed to wrap your spans too `class="s25">` – phuzi Dec 02 '19 at 14:35

1 Answers1

1

There's already a native method for walking a DOM tree, you should use the TreeWalker API. This method allows you to filter on just the text nodes like you're trying to do, no elements will be included:

const root = document.getElementById('root');
const treeWalker = document.createTreeWalker(root, NodeFilter.SHOW_TEXT, null, false);

let words = [];

while (treeWalker.nextNode()) {
  words = words.concat(treeWalker.currentNode.textContent.split(/(\s+)/).filter(e => e.trim().length > 0));
}

console.log(words);
<div id="root">
<p class="s21" style="padding-top: 10pt;padding-left: 31pt;text-indent: 0pt;text-align: left;">CONTINGENCY TIMEL
        INES.. • • • • • •• • • • • • • • • • • •• • • • • • ••• • •• • • • • •• • • • • •• • •<span class="s25">
        </span><span class="s26"> </span>4-<span class="s27">1</span></p>
</div>

n.b. that this is also trimming the white space and not including white space in the resulting array, but you could adjust that.

skyline3000
  • 7,639
  • 2
  • 24
  • 33