1

I am trying to convert all HTML Nodes into XPATH Here is a sample Input. Based on the HTML i am looking for all XPATH for all child nodes

<html>
    <head>
        <title>
            The Dormouse's story
        </title>
    </head>
    <body>
        <p class="title">
            <b>
                The Dormouse's story
            </b>
        </p>
        <span>Hello</span>
    </body>
</html>

Output I want

html
html/head
html/head/title
html/body 
html/body/p

What I have currently

{
    "name": "[document]",
    "attr": {},
    "children": [
        {
            "name": "html",
            "attr": {},
            "children": [
                {
                    "name": "head",
                    "attr": {},
                    "children": [
                        {
                            "name": "title",
                            "attr": {},
                            "children": []
                        }
                    ]
                },
                {
                    "name": "body",
                    "attr": {},
                    "children": [
                        {
                            "name": "p",
                            "attr": {
                                "class": [
                                    "title"
                                ]
                            },
                            "children": [
                                {
                                    "name": "b",
                                    "attr": {},
                                    "children": []
                                }
                            ]
                        },
                        {
                            "name": "span",
                            "attr": {},
                            "children": []
                        }
                    ]
                }
            ]
        }
    ]
}

The code

try:
    import os
    import lxml.etree
    from bs4 import BeautifulSoup
    import json
    import etree
except Exception as e:
    pass

def traverse(soup):

    if soup.name is not None:
        dom_dictionary = {}
        dom_dictionary['name'] = soup.name
        dom_dictionary['attr'] = soup.attrs

        dom_dictionary['children'] = [
            traverse(child)
            for child in soup.children if child.name is not None
        ]

        return dom_dictionary

with open("html.txt", "r") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    JsonDom = traverse(soup)
    print(json.dumps(JsonDom, indent=4))


Any help would be great if you guys can also point me in right direction would be great help

Any ideas suggestions would be great. i did look into Lxml bs4 and selenium but unfortunately no luck

Soumil Nitin Shah
  • 634
  • 2
  • 7
  • 18
  • 1
    Does this answer your question? [Get all child elements](https://stackoverflow.com/questions/24795198/get-all-child-elements) – Prophet Apr 21 '21 at 20:38
  • Also https://stackoverflow.com/questions/14052368/how-to-get-all-descendants-of-an-element-using-webdriver and few more were already asked and answered. – Prophet Apr 21 '21 at 20:39

1 Answers1

1
html_doc = """
<html>
    <head>
        <title>
            The Dormouse's story
        </title>
    </head>
    <body>
        <p class="title">
            <b>
                The Dormouse's story
            </b>
        </p>
        <span>Hello</span>
    </body>
</html>
"""


def generate(soup, cur=""):
    for tag in soup.find_all(recursive=False):
        yield cur + tag.name
        yield from generate(tag, cur=cur + tag.name + "/")


soup = BeautifulSoup(html_doc, "html.parser")  # you can also use  "lxml" or "html5lib"
for t in generate(soup):
    print(t)

Prints:

html
html/head
html/head/title
html/body
html/body/p
html/body/p/b
html/body/span
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • you should consider using `lxml` parser according to bs4 [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) which is very fast and better than all others. – αԋɱҽԃ αмєяιcαη Apr 21 '21 at 20:47