4

I have a question about BeautifulSoup in Python 3.I spent a couple of hours to try but I have not solved it yet.

This is my soup:

print(soup.prettify())
# REMEMBER THIS SOUP IS DYNAMIC
# <html>
#  <body>
#   <div class="title" itemtype="http://schema.org/FoodEstablishment">
#    <div class="address" itemtype="http://schema.org/PostalAddress">
#      <div class="address-inset">
#        <p itemprop="name">33 San Francisco</p>
#      </div>
#    </div>
#    <div class="image">
#      <img src=""/>
#      <span class="subtitle">image subtitle</p>
#    </div>
#    <a itemprop="name">The Dormouse's story</a>
#   </div>
#  </body>
# </html>

I have to extract two text by itemprop="name": The Dormouse's story and 33 San Francisco But I want need way to define what class is the parent.

Expected output:

{
   "FoodEstablishment": "The Dormouse's story",
   "PostalAddress": "33 San Francisco"
}

Remember the soup is always dynamic and have many chilren elements in it.

KitKit
  • 8,549
  • 12
  • 56
  • 82

3 Answers3

2

I get the itemtype and contents of each tag, then create a dictionary using update.

from bs4 import BeautifulSoup

html = """<html>
 <body>
  <div class="title" itemtype="http://schema.org/FoodEstablishment">
     <div class="address" itemtype="http://schema.org/PostalAddress">
     <p itemprop="name">33 San Francisco</p>
   </div>
   <p itemprop="name">The Dormouse's story</p>
  </div>

 </body>
</html>
"""
d = {}
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div"):
    # get the last string in itemtype separated by /
    itemType = item.get("itemtype").split('/')[-1]
    # remove newline(\n) from contents
    itemProp = list(filter(lambda a: a != '\n', item.contents))
    # create a dictionary of key: value
    d.update({itemType: itemProp[-1].text}) 

print(d)

Result: {'FoodEstablishment': "The Dormouse's story", 'PostalAddress': '33 San Francisco'} 
jose_bacoy
  • 12,227
  • 1
  • 20
  • 38
  • Your approach is ok. I tried this approach before I post this issue into stackoverflow. But I get issue when soup have a complex structure (

    have many parent wrappers)

    – KitKit Mar 10 '20 at 07:11
  • many thanks - your approach is great - i am a learner of python and i learned alot!! Keep up your great work – zero Mar 24 '20 at 22:27
1
from bs4 import BeautifulSoup


html = """<html>
 <body>
  <div class="title" itemtype="http://schema.org/FoodEstablishment">
   <div class="address" itemtype="http://schema.org/PostalAddress">
     <p itemprop="name">33 San Francisco</p>
   </div>
   <p itemprop="name">The Dormouse's story</p>
  </div>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

a = [item.get("itemtype") for item in soup.findAll("div", {'itemtype': True})]
b = soup.find("div", {'itemtype': True}).get_text(
    strip=True, separator="|").split("|")

print(a)
print(b)

output:

['http://schema.org/FoodEstablishment', 'http://schema.org/PostalAddress']
['33 San Francisco', "The Dormouse's story"]

Update:

soup = BeautifulSoup(html, 'html.parser')

names = [item.text for item in soup.findAll("p", itemprop="name")]
print(names)

Output:

['33 San Francisco', "The Dormouse's story"]
1

Why use soup.find when you can use soup.select, get help from all the CSS wiz kids and test your criteria in a browser first?

There's a performance benchmark on SO and select is faster, or at least not significantly slower, so that's not it. Habit, I guess.

(works just as well without the <p> tag qualifier, i.e. just "[itemprop=name]")

found = soup.select("p[itemprop=name]")

results = dict()

for node in found:

    itemtype = node.parent.attrs.get("itemtype", "?")
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

output:

It is what you asked for, but if many nodes existed with FoodEstablishment, last would win, because you are using a dictionary. A defaultdict with a list might work better, for you to judge.

{'PostalAddress': '33 San Francisco', 'FoodEstablishment': "The Dormouse's story"}

step 1, before Python: rock that CSS!

enter image description here

and if you need to check higher up ancestors for itemtype:

it would help if you had html with that happening:

    <div class="address" itemtype="http://schema.org/PostalAddress">
      <div>
        <p itemprop="name">33 San Francisco</p>  
      </div>

    </div>
found = soup.select("[itemprop=name]")

results = dict()

for node in found:

    itemtype = None
    parent = node.parent
    while itemtype is None and parent is not None:
      itemtype = parent.attrs.get("itemtype")
      if itemtype is None:
        parent = parent.parent


    itemtype = itemtype or "?"
    itemtype = itemtype.split("/")[-1]
    results[itemtype] = node.text

print(results)

same output.

using a defautdict

everything stays the same except for declaring the results and putting data into it.

from collections import defaultdict
...
results = defaultdict(list)
...

results[itemtype].append(node.text)
output (after I added a sibling to 33 San Francisco):
defaultdict(<class 'list'>, {'PostalAddress': ['33 San Francisco', '34 LA'], 'FoodEstablishment': ["The Dormouse's story"]})
JL Peyret
  • 10,917
  • 2
  • 54
  • 73
  • Thanks for your approach. This approach may work. But do you think about the approach which get list of `itemtype` first. Then find in `children` elements the `itemprop`? I'll pick you the best if you support this approach, as same as your approach. – KitKit Mar 10 '20 at 07:23
  • 1
    ??? I don't understand what you are saying, sorry. Why don't you try the different answers and see what works best? Don't forget to add a case where the `itemtype` is **not** found on the direct parent of `itemprop` element, because right now your test html does not have that. For me - but I am not talking about other answers - I would **not** search on `itemtype` and then find children with `itemprop`, too complicated. – JL Peyret Mar 10 '20 at 07:43
  • 1
    @KitKit in fact I would add `

    33 San Francisco

    34 LA

    ` and see what happens. only 34 LA is in the result, which is why I recommend you use a defaultdict.
    – JL Peyret Mar 10 '20 at 07:50
  • I tried with your code and fix something and now It works properly. I think this is the best solution for my issue. Thank you very much, bro – KitKit Mar 10 '20 at 07:54