38

i have an XML file with an defined structure but different number of tags, like

file1.xml:

<document>
  <subDoc>
    <id>1</id>
    <myId>1</myId>
  </subDoc>
</document>

file2.xml:

<document>
  <subDoc>
    <id>2</id>
  </subDoc>
</document>

Now i like to check, if the tag myId exits. So i did the following:

data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)

hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)

The result is for file1.xml:

hasAttrBs -> False
hasAttrPy -> True
hasType ->   <class 'bs4.element.Tag'>

file2.xml:

hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>

Okay, <myId> is not an attribute of <subdoc>.

But how i can test, if an sub-tag exists?

//Edit: By the way: I'm don't really like to iterate trough the whole subdoc, because that will be very slow. I hope to find an way where I can direct address/ask that element.

The Bndr
  • 13,204
  • 16
  • 68
  • 107

8 Answers8

42
if tag.find('child_tag_name'):
ahuigo
  • 2,929
  • 2
  • 25
  • 45
16

The simplest way to find if a child tag exists is simply

childTag = xml.find('childTag')
if childTag:
    # do stuff

More specifically to OP's question:

If you don't know the structure of the XML doc, you can use the .find() method of the soup. Something like this:

with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
    xml = BeautifulSoup(data.read())
    xml2 = BeautifulSoup(data2.read())

    hasAttrBs = xml.find("myId")
    hasAttrBs2 = xml2.find("myId")

If you do know the structure, you can get the desired element by accessing the tag name as an attribute like this xml.document.subdoc.myid. So the whole thing would go something like this:

with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
    xml = BeautifulSoup(data.read())
    xml2 = BeautifulSoup(data2.read())

    hasAttrBs = xml.document.subdoc.myid
    hasAttrBs2 = xml2.document.subdoc.myid
    print hasAttrBs
    print hasAttrBs2

Prints

<myid>1</myid>
None
wpercy
  • 9,636
  • 4
  • 33
  • 45
  • 2
    ...but `find()` searches trough the document, right? But, I know the position of the tag insight the xml tree (if it exists). So is there no easy way to directly address an element or check if that element exists? – The Bndr Oct 29 '15 at 15:00
  • Oh okay, I'm sorry I misunderstood the first time. I've updated my answer. – wpercy Oct 29 '15 at 15:38
  • Oh, I see.... "Keep it simple" is sometimes the best way. Thank you for open my eyes... – The Bndr Nov 02 '15 at 13:25
4

Here's an example to check if h2 tag exists in an Instagram URL. Hope you find it useful:

import datetime
import urllib
import requests
from bs4 import BeautifulSoup

instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")

if not soup.find('h2'):
    print("didn't find h2")
GustavoIP
  • 873
  • 2
  • 8
  • 25
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
  • This line right here " if not soup.find('h2'):" just saved me tons of headaches. I didn't know about this. Thank you! – M4cJunk13 Jan 06 '18 at 01:28
  • within bs4 tags, use `has_attr(key)` instead, like `alt_image_text = [tag["alt"] for tag in images if tag.has_attr("alt")]`. Note that tag.src always seems to return None. – Marc Maxmeister Nov 06 '18 at 19:32
1

you can handle it like this:

for child in xml.document.subdoc.children:
    if 'myId' == child.name:
       return True
chyoo CHENG
  • 720
  • 2
  • 9
  • 22
  • Thank you. But: The think is, that I'm don't really like to iterate trough the whole subdoc, because these are large docs and I have to walk trough thousands of xml files. I hope to find an way where I can direct address/ask that element. – The Bndr Oct 20 '15 at 14:02
1

You can do it with if tag.myID:

If you want to check if myID is the direct child not child of child use if tag.find("myID", recursive=False):

If you want to check if tag has no child, use if tag.find(True):

LF00
  • 27,015
  • 29
  • 156
  • 295
1
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
soup = BeautifulSoup(page.content, 'html.parser')
testNode = list(soup.children)[1]

def hasChild(node):
    print(type(node))
    try:
        node.children
        return True
    except:
        return False

 if( hasChild(testNode) ):
     firstChild=list(testNode.children)[0]
     if( hasChild(firstChild) ):
        print('I found Grand Child ')
user2458922
  • 1,691
  • 1
  • 17
  • 37
0

if you are using a CSS selector

content = soup_elm.select('.css_selector')
if len(content) == 0:
    return None
XY L
  • 25,431
  • 14
  • 84
  • 143
0

You could also try it this way :

response = requests.get("Your URL here")
soup = BeautifulSoup(response.text,'lxml')
RESULT = soup.select_one('CSS_SELECTOR_HERE') # for one element search 
print(RESULT)

Note that the CSS Selector for Bs4 is a little different to other selector methods. Click Here for documentation on how to use CSS selectors.

soup.select works for an all element selection and works for elements with attributes as well.

Stimmot
  • 999
  • 1
  • 7
  • 22