BeautifulSoup: Classify parent and children element

Question

I have a question about BeautifulSoup in Python 3.I spent a couple of hours to try but I have not solved it yet.

This is my soup:

print(soup.prettify())
# REMEMBER THIS SOUP IS DYNAMIC
# <html>
#  <body>
#   <div class="title" itemtype="http://schema.org/FoodEstablishment">
#    <div class="address" itemtype="http://schema.org/PostalAddress">
#      <div class="address-inset">
#        <p itemprop="name">33 San Francisco</p>
#      </div>
#    </div>
#    <div class="image">
#      <img src=""/>
#      <span class="subtitle">image subtitle</p>
#    </div>
#    <a itemprop="name">The Dormouse's story</a>
#   </div>
#  </body>
# </html>

I have to extract two text by itemprop="name": The Dormouse's story and 33 San Francisco But I want need way to define what class is the parent.

Expected output:

{
   "FoodEstablishment": "The Dormouse's story",
   "PostalAddress": "33 San Francisco"
}

Remember the soup is always dynamic and have many chilren elements in it.

Once you have targeted the tag, you can just use attribut `parent` to get the parent tag — Maaz, Mar 09 '20 at 16:35
@Maaz `parent` is useful for a simple soup. I want to say about a complex soup which have many parents — KitKit, Mar 09 '20 at 16:37

jose_bacoy · Answer 1 · 2020-03-09T18:26:00.440

I get the itemtype and contents of each tag, then create a dictionary using update.

from bs4 import BeautifulSoup

html = """<html>
 <body>
  <div class="title" itemtype="http://schema.org/FoodEstablishment">
     <div class="address" itemtype="http://schema.org/PostalAddress">
     <p itemprop="name">33 San Francisco</p>
   </div>
   <p itemprop="name">The Dormouse's story</p>
  </div>

 </body>
</html>
"""
d = {}
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div"):
    # get the last string in itemtype separated by /
    itemType = item.get("itemtype").split('/')[-1]
    # remove newline(\n) from contents
    itemProp = list(filter(lambda a: a != '\n', item.contents))
    # create a dictionary of key: value
    d.update({itemType: itemProp[-1].text}) 

print(d)

Result: {'FoodEstablishment': "The Dormouse's story", 'PostalAddress': '33 San Francisco'}

Your approach is ok. I tried this approach before I post this issue into stackoverflow. But I get issue when soup have a complex structure (
have many parent wrappers) — KitKit, Mar 10 '20 at 07:11
many thanks - your approach is great - i am a learner of python and i learned alot!! Keep up your great work — zero, Mar 24 '20 at 22:27

αԋɱҽԃ αмєяιcαη · Answer 2 · 2020-03-09T17:23:57.497

1

from bs4 import BeautifulSoup


html = """<html>
 <body>
  <div class="title" itemtype="http://schema.org/FoodEstablishment">
   <div class="address" itemtype="http://schema.org/PostalAddress">
     <p itemprop="name">33 San Francisco</p>
   </div>
   <p itemprop="name">The Dormouse's story</p>
  </div>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

a = [item.get("itemtype") for item in soup.findAll("div", {'itemtype': True})]
b = soup.find("div", {'itemtype': True}).get_text(
    strip=True, separator="|").split("|")

print(a)
print(b)

output:

['http://schema.org/FoodEstablishment', 'http://schema.org/PostalAddress']
['33 San Francisco', "The Dormouse's story"]

Update:

soup = BeautifulSoup(html, 'html.parser')

names = [item.text for item in soup.findAll("p", itemprop="name")]
print(names)

Output:

['33 San Francisco', "The Dormouse's story"]

edited Mar 09 '20 at 17:23

answered Mar 09 '20 at 17:10

αԋɱҽԃ αмєяιcαη

11,825
3
17
50

Remember this soup is dynamic and have many elements in it. I updated the HTML soup – KitKit Mar 09 '20 at 17:14
@KitKit well, i don't understand your requirement, based on what you want to match ? – αԋɱҽԃ αмєяιcαη Mar 09 '20 at 17:15
Can you have the code which get a list of all elements which have attribute `itemprop="name"` and then check the closest parent `itemtype`? – KitKit Mar 09 '20 at 17:19
@KitKit ok let's walk through the process as i try to understand you. check the update. now we got the list. now what you want next? – αԋɱҽԃ αмєяιcαη Mar 09 '20 at 17:24
How to know the wrapper element in each output? Ex: '33 San Francisco' => `PostalAddress ` and "The Dormouse's story" => `FoodEstablishment ` – KitKit Mar 09 '20 at 17:26

JL Peyret · Accepted Answer · 2020-03-10T19:32:36.337

Why use soup.find when you can use soup.select, get help from all the CSS wiz kids and test your criteria in a browser first?

There's a performance benchmark on SO and select is faster, or at least not significantly slower, so that's not it. Habit, I guess.

(works just as well without the <p> tag qualifier, i.e. just "[itemprop=name]")

found = soup.select("p[itemprop=name]")

results = dict()

for node in found:

    itemtype = node.parent.attrs.get("itemtype", "?")
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

output:

It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.

{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}

step 1, before Python: rock that CSS!

and if you need to check higher up ancestors for `itemtype`:

it would help if you had html with that happening:

    <div class="address" itemtype="http://schema.org/PostalAddress">
      <div>
        <p itemprop="name">33 San Francisco</p>  
      </div>

    </div>

found = soup.select("[itemprop=name]")

results = dict()

for node in found:

    itemtype = None
    parent = node.parent
    while itemtype is None and parent is not None:
      itemtype = parent.attrs.get("itemtype")
      if itemtype is None:
        parent = parent.parent


    itemtype = itemtype or "?"
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

same output.

using a defautdict

everything stays the same except for declaring the results and putting data into it.

from collections import defaultdict
...
results = defaultdict(list)
...

results[itemtype].append(node.text)

output (after I added a sibling to 33 San Francisco):

defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})

Thanks for your approach. This approach may work. But do you think about the approach which get list of `itemtype` first. Then find in `children` elements the `itemprop`? I'll pick you the best if you support this approach, as same as your approach. — KitKit, Mar 10 '20 at 07:23
??? I don't understand what you are saying, sorry. Why don't you try the different answers and see what works best? Don't forget to add a case where the `itemtype` is **not** found on the direct parent of `itemprop` element, because right now your test html does not have that. For me - but I am not talking about other answers - I would **not** search on `itemtype` and then find children with `itemprop`, too complicated. — JL Peyret, Mar 10 '20 at 07:43
@KitKit in fact I would add `

33 San Francisco

34 LA

` and see what happens. only 34 LA is in the result, which is why I recommend you use a defaultdict. — JL Peyret, Mar 10 '20 at 07:50
I tried with your code and fix something and now It works properly. I think this is the best solution for my issue. Thank you very much, bro — KitKit, Mar 10 '20 at 07:54

BeautifulSoup: Classify parent and children element

3 Answers3

output:

step 1, before Python: rock that CSS!

and if you need to check higher up ancestors for itemtype:

using a defautdict

and if you need to check higher up ancestors for `itemtype`: