How to parse html such that the nesting that is implicit by header levels becomes explicit?

Question

I run into the same problem in a different form while web-scraping, over and over, and for some reason I can't seem to push my head through it.

The core of it is basically this:

HTML has a relatively flat organizational structure with some nesting implicit. I want to make that explicit.

To show what I mean consider the following fictional menu snippet:

Motley Mess Menu

Breakfast

Omelets

Cheese

$7

American style omelet containing your choice of Cheddar, Swiss, Feta, Colby Jack or all four!

Sausage Mushroom

$8

American style omelet containing sausage, mushroom and Swiss cheese

Build-Your-Own

$8

American style omelet containing…you tell me!

Options (+50 cents after 3):

Cheddar

Swiss

Feta

Colby Jack

Bacon Bits

Sausage

Onion

Hamburger

Jalapenos

Hash Browns

Combos

...

When we read this menu we know that "Sausage Mushroom" is a type of "Omelet" served for "Breakfast" at the "Motley Mess." We understand the nesting just fine, however if this were represented via html (or in markdown for that matter) all of those headers are flat, and without ading a series of divs that nesting is all implicit. If I'm web scraping I have no control over whether or not those divs are present.

I want to parse the html to make that nesting explicit. This is a problem I have come across time and time again scraping websites and I always find another way to solve the problem. I feel that this should be a relatively basic problem to solve, if not simple, but for some reason I can't get past the weird dynamic recursion that ends up necessary, and I think I'm grossly overcomplicating it.

This last snippet is an html json pair which I would be happy with from a hypothetical html_unpacker function:

html_string = """
<h1>Motley Mess Menu</h1>
<h2>Breakfast</h2>
<h3>Omelets</h3>
<h4>Cheese</h4>
<p>$7</p>
<p>American style omelet containing your choice of Cheddar, Swiss, Feta, Colby Jack or all four!</p>
<h4>Sausage Mushroom</h4>
<p>$8</p>
<p>American style omelet containing sausage, mushroom and Swiss cheese</p>
<h>Build-Your-Own</h4>
<p>$8</p>
<p>American style omelet containing…you tell me!</p>
<p>Options (+50 cents after 3):</p>
<ul>
<li>Cheddar</li>
<li>Swiss</li>
<li>Feta</li>
<li>Colby Jack</li>
<li>Bacon Bits</li>
<li>Sausage</li>
<li>Onion</li>
<li>Hamburger</li>
<li>Jalapenos</li>
<li>Hash Browns</li>
</ul>
<h3>Combos</h3>
<p>Each come with your choice of two sides</p>
<h4>Eggs and Bacon</h4>
<p>$8</p>
<p>Eggs cooked your way and crispy bacon. Sausage substitution is fine</p>
<h4>Glorious Smash</h4>
<p>$10</p>
<p>Your favorite breakfast of two pancakes, two eggs cooked your way, two sausages and two bacon, free of all trademark infringement! If you think you can finish it all then you forgot about the choice of two sides!</p>

html_unpacker(html_string)

output:

{
    "Motley Mess Menu": {
        "Breakfast": {
            "Omelets": {
                "Cheese": {
                    "p1": "$7",
                    "p2": "American style omelet containing your choice of Cheddar, Swiss, Feta, Colby Jack or all four!"
                },
                "Sausage Mushroom": {
                    "p1": "$8",
                    "p2": "American style omelet containing sausage, mushroom and Swiss cheese"
                },
                "Build-Your-Own": {
                    "p1": "$8",
                    "p2": "American style omelet containing…you tell me!",
                    "p3": "Options (+50 cents after 3):",
                    "ul1": {
                        "li1": "Swiss",
                        "li2": "Feta",
                        "li3": "Colby Jack",
                        "li4": "Bacon Bits",
                        "li5": "Sausage",
                        "li6": "Onion",
                        "li7": "Hamburger",
                        "li8": "Jalapenos",
                        "li9": "Hash Browns"
                    }
                }
            },
            "Combos": {
                "p1": "Each come with your choice of two sides",
                "Eggs and Bacon": {
                    "p1": "$8",
                    "p2": "Eggs cooked your way and crispy bacon. Sausage substitution is fine"
                },
                "Glorious Smash": {
                    "p1": "$10",
                    "p2": "Your favorite breakfast of two pancakes, two eggs cooked your way, two sausages and two bacon, free of all trademark infringement! If you think you can finish it all then you forgot about the choice of two sides!"
                }
            }
        }
    }
}

I don't necessarily need that exact style of output, just something that makes the nesting explicit and maintins the type ond order on non-header elements. I need explicit nestig to be preserved (lists within lists and whatnot) I just need to add some explicit nesting based on header levels.

I'm not asking for someone to build a function from scratch, this just seems so basic I feel like something of this nature must already exist and my google-fu must just be failing me.

Probably the key is to get a function that can be called recursively for every recognized html tag, increasing indentation by one for each opening tag and reducing indentation by one for each closing tag. Of course the function should call itself for each opening tag and return for each closing tag. BTW you have a typo at the line Build-Your-Own, should be
, I think. — Shiping, Jan 22 '23 at 21:49

Driftr95 · Accepted Answer · 2023-01-27T20:58:59.533

You could do something like this:

# generates keys to be used in dictionary - tag name or object type
def getElemKey(pEl):
    if isinstance(pEl, str): return 'string'
    pName, pClass = getattr(pEl, 'name', None), str(type(pEl))
    pClass = pClass.replace("<class '",'',1).rstrip("'>").split('.')[-1]
    return str(pName).strip() if isinstance(pName, str) else pClass

# converts list of tuples to dictionary; adds _# to avoid duplicate keys
def dict_enumKeys(tupList, noCount=['textContent', 'contents']):
    invalidT = [t for t in tupList if (
        isinstance(tupList, tuple) and len(t)==2
    )] if isinstance(tupList, list) else True
    if invalidT: return tupList
    if len(tupList)==1 and tupList[0][0] in noCount: return tupList[0][1]
    
    keyCt, toRet = {}, {}
    for k, v in tupList:
        kCt = keyCt[str(k)] = keyCt.get(str(k), 0) + 1
        if not (k in noCount and kCt==1): k = f'{k}_{kCt}'
        try: toRet[k] = dict_enumKeys(v)
        except RecursionError: toRet[k] = v
    return toRet

        

def nestHtmlChildren(pTag, chRef={}, asDict='no_str'): 
    chList, tnList = [], chRef.get(getElemKey(pTag), None)
    if tnList is not None:
        if not isinstance(tnList, list): tnList = [tnList]
        sel = ', '.join([f'{s}:not({s} {s})' for s in tnList])
        chList = [s for s in pTag.select(f':where({sel})')]
    chList = [c for c in chList if not(isinstance(c,str) and not str(c))]
    
    if chList:
        try: 
            tList = [(getElemKey(childEl), nestHtmlChildren(
                childEl, chRef=chRef, asDict=True
            )) for childEl in chList]
            return dict_enumKeys(tList) if asDict else tList
        except RecursionError: pass
    
    tCon = pTag.get_text(' ').strip() if callable(
        getattr(pTag, 'get_text', None)
    ) else str(pTag)
    return {'textContent': tCon} if asDict=='no_str' else (
        tCon if asDict else [('textContent', tCon)])

    
def nestHtmlSiblings(hSoup, levelsRef, cNestRef={}, recursive=False):
    sibList, isRoot = getattr(hSoup, 'contents', None), getElemKey(hSoup)
    if not hSoup: return hSoup
    if not isinstance(hSoup, list):  
        if not (sibList and isinstance(sibList, list)):
            hDict = nestHtmlChildren(hSoup, cNestRef, 'no_str') 
            return {getElemKey(hSoup): hDict} 
    else: sibList, isRoot = hSoup, False

    if not all([isinstance(s, tuple) and len(s)>1 for s in sibList]):
        sibList, retList = [
            s[:2] if isinstance(s, tuple) and len(s)>1
            else (getElemKey(s), s) for s in sibList
        ], False 
    else: retList = True 

    nestSibs, curContainer, sibContainer, curKey,  = [], [], [], None
    pKeys, maxLevel = list(levelsRef.keys()), max(levelsRef.values())
    for k, el in sibList + [(None, None)]:
        isLast = k is None and el is None
        invCur = curKey is None or curKey not in pKeys 
        if not (k in pKeys or isLast or invCur): 
            sibContainer.append((k, el))
            continue

        if curKey is not None:  
            try: 
                sibContainer = [s for s in sibContainer if not (
                    s[0]=='string' and not str(s[1]).strip()     )]
                for_cc = nestHtmlSiblings(
                    sibContainer, levelsRef, cNestRef, recursive) 
            except RecursionError: 
                for_cc = [('error', f'{type(r)} {r}'), ('curEl', str(el))]
            nestSibs += [(curKey, curContainer+(for_cc if for_cc else[]))]
        
        
        curKey, curContainer, sibContainer = k, [], []
        pKeys = [
            lk for lk,l in levelsRef.items() 
            if levelsRef.get(k, maxLevel) >= l
        ]      
        if isLast: continue

        try:
            if recursive and callable(getattr(el, 'find', None)):
                if not isinstance(el,str) and el.find(pKeys):
                    curContainer.append(('contents', nestHtmlSiblings(
                        el.contents, levelsRef, cNestRef, recursive)))
                    continue
            curContainer += nestHtmlChildren(el, cNestRef, asDict=False)
        except RecursionError as r: 
            curContainer += [('error', f'{type(r)} {r}'), 
                             ('curEl', str(el))]
            
            
    if isRoot: nestSibsDict = {isRoot: dict_enumKeys(nestSibs)} 
    elif retList: nestSibsDict = nestSibs
    else: nestSibsDict = dict_enumKeys(nestSibs)
    return nestSibsDict

[ nestHtmlChildren wouldn't be necessary, if you didn't also want to parse into into items the li tags, which are not siblings but children of the ul tag... Also, I tried to wrap every recursive call in a try...except, but I'm not sure I dot them all - it might be better to set a (constant or parametrized) limit on the recursion depth; check depth at the beginning of each function (except getElemKey which is not recursive) and return null/error if limit is exceeded. ]

# from bs4 import BeautifulSoup

# with <h> tag fixed to <h4>
html_string = """
  <h1>Motley Mess Menu</h1>
  <h2>Breakfast</h2>
  <h3>Omelets</h3>
  <h4>Cheese</h4>
  <p>$7</p>
  <p>American style omelet containing your choice of Cheddar, Swiss, Feta, Colby Jack or all four!</p>
  <h4>Sausage Mushroom</h4>
  <p>$8</p>
  <p>American style omelet containing sausage, mushroom and Swiss cheese</p>
  <h4>Build-Your-Own</h4>
  <p>$8</p>
  <p>American style omelet containing…you tell me!</p>
  <p>Options (+50 cents after 3):</p>
  <ul>
  <li>Cheddar</li>
  <li>Swiss</li>
  <li>Feta</li>
  <li>Colby Jack</li>
  <li>Bacon Bits</li>
  <li>Sausage</li>
  <li>Onion</li>
  <li>Hamburger</li>
  <li>Jalapenos</li>
  <li>Hash Browns</li>
  </ul>
  <h3>Combos</h3>
  <p>Each come with your choice of two sides</p>
  <h4>Eggs and Bacon</h4>
  <p>$8</p>
  <p>Eggs cooked your way and crispy bacon. Sausage substitution is fine</p>
  <h4>Glorious Smash</h4>
  <p>$10</p>
  <p>Your favorite breakfast of two pancakes, two eggs cooked your way, two sausages and two bacon, free of all trademark infringement! If you think you can finish it all then you forgot about the choice of two sides!</p>
"""
soup = BeautifulSoup(html_string, 'html5lib')

tagLevels = {**{f'h{h}':h for h in range(1,7)}, 'p':8, 'ul':8}
# tagLevels = {'h1': 1, 'h2': 2, 'h3': 3, 'h4': 4, 'h5': 5, 'h6': 6, 'p': 8, 'ul': 8}
nestKids = {'ul': 'li'} # , 'table':'tr', 'tr':'td', 'dl':['dt','dd']}
obj = nestHtmlSiblings(soup.body.contents, tagLevels, nestKids)

obj should look like

{
    "h1_1": {
        "textContent": "Motley Mess Menu",
        "h2_1": {
            "textContent": "Breakfast",
            "h3_1": {
                "textContent": "Omelets",
                "h4_1": {
                    "textContent": "Cheese",
                    "p_1": "$7",
                    "p_2": "American style omelet containing your choice of Cheddar, Swiss, Feta, Colby Jack or all four!"
                },
                "h4_2": {
                    "textContent": "Sausage Mushroom",
                    "p_1": "$8",
                    "p_2": "American style omelet containing sausage, mushroom and Swiss cheese"
                },
                "h4_3": {
                    "textContent": "Build-Your-Own",
                    "p_1": "$8",
                    "p_2": "American style omelet containing\u2026you tell me!",
                    "p_3": "Options (+50 cents after 3):",
                    "ul_1": {
                        "li_1": "Cheddar",
                        "li_2": "Swiss",
                        "li_3": "Feta",
                        "li_4": "Colby Jack",
                        "li_5": "Bacon Bits",
                        "li_6": "Sausage",
                        "li_7": "Onion",
                        "li_8": "Hamburger",
                        "li_9": "Jalapenos",
                        "li_10": "Hash Browns"
                    }
                }
            },
            "h3_2": {
                "textContent": "Combos",
                "p_1": "Each come with your choice of two sides",
                "h4_1": {
                    "textContent": "Eggs and Bacon",
                    "p_1": "$8",
                    "p_2": "Eggs cooked your way and crispy bacon. Sausage substitution is fine"
                },
                "h4_2": {
                    "textContent": "Glorious Smash",
                    "p_1": "$10",
                    "p_2": "Your favorite breakfast of two pancakes, two eggs cooked your way, two sausages and two bacon, free of all trademark infringement! If you think you can finish it all then you forgot about the choice of two sides!"
                }
            }
        }
    }
}

This works and I i'll take it as confirmation that I was not overthinking the problem. — psychicesp, Feb 12 '23 at 00:52
@psychicesp glad it works for you too, and yeah this wasn't a simple solve - I've done [something a bit similar](https://stackoverflow.com/a/74036946/6146136) before [for scraping Wikipedia pages], but this had the additional complexities of nested children [like `ul>li`]; it might have been easier to just have a list of subsections [instead of using `dict_enumKeys`] with the tag names as dictionary values, but I liked your dictionary structure better — Driftr95, Feb 12 '23 at 02:11

How to parse html such that the nesting that is implicit by header levels becomes explicit?

Motley Mess Menu

Breakfast

Omelets

Cheese

Sausage Mushroom

Build-Your-Own

Combos

, I think.

1 Answers1

Linked