2

I'm hoping to check if two html are different by tags only without considering the text and pick out those branch(es).

For example :

html_1 = """
<p>i love it</p>
"""
html_2 = """ 
<p>i love it really</p>
"""

They share the same tag structure, so they're seen to be the same. However:

html_1 = """
<div>
<p>i love it</p>
</div>
<p>i love it</p>
"""
html_2 = """ 
<div>
<p>i <em>love</em> it</p>
</div>
<p>i love it</p>
"""

I'd expect it to return the <div> branch, because the tag structures are different. Could lxml, BeautifulSoup or some other lib achieve this? I'm trying to find a way to actually pick out the different branches.

Thanks

Kar
  • 6,063
  • 7
  • 53
  • 82

2 Answers2

1

A more reliable approach would be to construct a Tree of tag names out of the document as discussed here:

Here is an example working solution based on treelib.Tree:

from bs4 import BeautifulSoup
from treelib import Tree


def traverse(parent, tree):
    tree.create_node(parent.name, parent.name, parent=parent.parent.name if parent.parent else None)

    for node in parent.find_all(recursive=False):
        tree.create_node(node.name, parent=parent.name)
        traverse(node, tree)


def compare(html1, html2):
    tree1 = Tree()
    traverse(BeautifulSoup(html1, "html.parser"), tree1)
    tree2 = Tree()
    traverse(BeautifulSoup(html2, "html.parser"), tree2)

    return tree1.to_json() == tree2.to_json()

print compare("<p>i love it</p>", "<p>i love it really</p>")
print compare("<p>i love it</p>", "<p>i <em>love</em> it</p>")

Prints:

True
False
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks. I'm actually hoping to pick out the actual branches that contain the different tagging structures, could this be achieved with minor modifications? – Kar Jun 22 '15 at 20:13
  • @Kate you can iterate over the trees (`tree1` and `tree2`) and compare them node by node. – alecxe Jun 23 '15 at 14:21
0

Sample code to check tagging structure of two HTML content are same for not

Demo:

def getTagSequence(content):
    """                  
    Get all Tag Sequence
    """
    root = PARSER.fromstring(content)
    tag_sequence = []
    for elm in root.getiterator():
        tag_sequence.append(elm.tag)
    return tag_sequence

html_1_tags = getTagSequence(html_1)
html_2_tags = getTagSequence(html_2)

if html_1_tags==html_2_tags:
     print "Tagging structure is same."
else:
     print "Tagging structure is diffrent."
     print "HTML 1 Tagging:", html_1_tags
     print "HTML 2 Tagging:", html_2_tags

Note:

Above code just check tagging sequence only, Not checking parent and its children relationship i.e

html_1 = """ <p> This <span>is <em>p</em></span> tag</p>"""
html_2 = """ <p> This <span>is </span><em>p</em> tag</p>"""
Vivek Sable
  • 9,938
  • 3
  • 40
  • 56
  • Thanks. I'm actually after something that identifies the branch different in tagging structure. I guess the examples I gave aren't particularly good since the structures are somewhat flat and not trees. – Kar Jun 22 '15 at 17:41
  • 1
    If I understand the code correctly, the code identifies whether the two pieces of HTML are different in tagging structure but not identifying the actual branches that contain the different tagging structure, right? I'm actually hoping to achieve the latter. – Kar Jun 22 '15 at 17:47
  • @Kate: branches means?? We have to pick any tag from the content and compare that tag, tagging structure in two different HTML's ? – Vivek Sable Jun 23 '15 at 07:25
  • Yes, that's right. For each tag in one document, compare that tag's structure with each tag's structure in another document. I think traditional iteration would be quite inefficient, so I wonder if there's something better. – Kar Jun 23 '15 at 07:31