I'm writing a script to retrieve text nodes (and other related elements) from an HTML document. Based on this answer, I was using the following. (The definition for the acceptTextNode
function is omitted for simplicity.)
var textNodes = [];
var treeWalker = document.createTreeWalker(
rootNode,
NodeFilter.SHOW_ALL,
{ acceptNode: acceptTextNode });
while (treeWalker.nextNode())
textNodes.push(treeWalker.currentNode);
However, I discovered that this approach fails when the document contains other documents nested within <iframe>
elements, such as for the "Compose" facility in Outlook.com. (Assume that the domains of the <iframe>
documents as the same as the parent document.)
I managed to work around the issue by retrieving the descendent documents manually, using getElementsByTagName
:
var textNodes = [];
var rootNodes = [ rootNode ];
for (var i = 0; i < rootNodes.length; i++)
{
if (rootNodes[i].getElementsByTagName)
{
var childFrames = rootNodes[i].getElementsByTagName("iframe");
for (var j = 0; j < childFrames.length; j++)
if (childFrames[j].contentDocument)
rootNodes.push(childFrames[j].contentDocument);
}
}
for (var i = 0; i < rootNodes.length; i++)
{
var treeWalker = document.createTreeWalker(
rootNodes[i],
NodeFilter.SHOW_ALL,
{ acceptNode: acceptTextNode });
while (treeWalker.nextNode())
textNodes.push(treeWalker.currentNode);
}
However, this feels like a hack, since it's combining manual traversal with the built-in TreeWalker
. Is there a better approach?