13

I'm failing miserably to get an attribute value using BeautifulSoup and Python. Here is how the XML is structured:

...
</total>
<tag>
    <stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
    <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
    ...
    <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>
<suite>
...

What I'm trying to get is the pass value, but for the life of me I just can't understand how to do it. I checked the BeautifulSoup and it seems that I should be using something like stat['pass'], but that doesn't seem to work.

Here's my code:

with open('../results/output.xml') as raw_resuls:
results = soup(raw_resuls, 'lxml')
for stat in results.find_all('tag'):
            print stat['pass']

If I do results.stat['pass'] it returns a value that is within another tag, way up in the XML blob.

If I print the stat variable I get the following:

<stat fail="0" pass="1">TR=787878 Sandbox=3000614</stat>
...
<stat fail="0" pass="1">TR=888888 Sandbox=3000610</stat>

Which seems to be ok.

I'm pretty sure that I'm missing something or doing something wrong. Where should I be looking at? Am I taking the wrong approach?

Any advice or guidance will be greatly appreciated! Thanks

4 Answers4

16

Please consider this approach:

from bs4 import BeautifulSoup

with open('test.xml') as raw_resuls:
    results = BeautifulSoup(raw_resuls, 'lxml')

for element in results.find_all("tag"):
    for stat in element.find_all("stat"):
        print(stat['pass'])

The problem of your solution is that pass is contained in stat and not in tag where you search for it.

This solution searches for all tag and in these tag it searches for stat. From these results it gets pass.

For the XML file

<tag>
    <stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
    <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
    <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>

the script above gets the output

1
1
1

Addition

Since some detailes still seemed to be unclear (see comments) consider this complete workaround using BeautifulSoup to get everything you want. This solution using dictionaries as elements of lists might not be perfect if you face performance issues. But since you seem to have some troubles using the Python and Soup i thought I create this example as easy as possible by giving the possibility to access all relevant information by name and not by an index.

from bs4 import BeautifulSoup

# Parses a string of form 'TR=abc123 Sandbox=abc123' and stores it in a dictionary with the following
# structure: {'TR': abc123, 'Sandbox': abc123}. Returns this dictionary. 
def parseTestID(testid):
    dict = {'TR': testid.split(" ")[0].split("=")[1], 'Sandbox': testid.split(" ")[1].split("=")[1]}
    return dict

# Parses the XML content of 'rawdata' and stores pass value, TR-ID and Sandbox-ID in a dictionary of the 
# following form: {'Pass': pasvalue, TR': TR-ID, 'Sandbox': Sandbox-ID}. This dictionary is appended to
# a list that is returned.
def getTestState(rawdata):
    # initialize parser
    soup = BeautifulSoup(rawdata,'lxml')
    parsedData= []

    # parse for tags
    for tag in soup.find_all("tag"):
        # parse tags for stat
        for stat in tag.find_all("stat"):
            # store everthing in a dictionary
            dict = {'Pass': stat['pass'], 'TR': parseTestID(stat.string)['TR'], 'Sandbox': parseTestID(stat.string)['Sandbox']}
            # append dictionary to list
            parsedData.append(dict)

    # return list
    return parsedData

You can use the script above as follows to do whatever you want (e.g. just print out)

# open file
with open('test.xml') as raw_resuls:
    # get list of parsed data 
    data = getTestState(raw_resuls)

# print parsed data
for element in data:
    print("TR = {0}\tSandbox = {1}\tPass = {2}".format(element['TR'],element['Sandbox'],element['Pass']))

The output looks like this

TR = 111111 Sandbox = 3000613   Pass = 1
TR = 121212 Sandbox = 3000618   Pass = 1
TR = 222222 Sandbox = 3000612   Pass = 1
TR = 232323 Sandbox = 3000618   Pass = 1
TR = 333333 Sandbox = 3000605   Pass = 1
TR = 343434 Sandbox = ZZZZZZ    Pass = 1
TR = 444444 Sandbox = 3000604   Pass = 1
TR = 454545 Sandbox = 3000608   Pass = 1
TR = 545454 Sandbox = XXXXXX    Pass = 1
TR = 555555 Sandbox = 3000617   Pass = 1
TR = 565656 Sandbox = 3000615   Pass = 1
TR = 626262 Sandbox = 3000602   Pass = 1
TR = 666666 Sandbox = 3000616   Pass = 1
TR = 676767 Sandbox = 3000599   Pass = 1
TR = 737373 Sandbox = 3000603   Pass = 1
TR = 777777 Sandbox = 3000611   Pass = 1
TR = 787878 Sandbox = 3000614   Pass = 1
TR = 828282 Sandbox = 3000600   Pass = 1
TR = 888888 Sandbox = 3000610   Pass = 1
TR = 999999 Sandbox = 3000617   Pass = 1

Let's summerize the core elements that are used:

Finding XML tags To find XML tags you use soup.find("tag") which returns the first matched tag or soup.find_all("tag") which finds all matching tags and stores them in a list. The single tags can easily be accessed by iterating over the list.

Finding nested tags To find nested tags you can use find() or find_all() again by applying it to the result of the first find_all().

Accessing the content of a tag To access the content of a tag you apply string to a single tag. For example if tag = <tag>I love Soup!</tag> tag.string = "I love Soup!".

Finding values of attributes To get the values of attributes you can use the subscript notation. For example if tag = <tag color=red>I love Soup!</tag> tag['color']="red".

For parsing strings of form "TR=abc123 Sandbox=abc123" I used common Python string splitting. You can read more about it here: How can I split and parse a string in Python?

Community
  • 1
  • 1
dtell
  • 2,488
  • 1
  • 14
  • 29
  • I see, I understand now and totally makes sense! It works just fine now, appreciate it!. I have one more question if it's ok to ask: since I only have one `tag` attribute, is a for loop needed? If not, how can I go straight to that `tag` attribute? Thanks! –  Apr 03 '17 at 21:25
  • It's great that I could help you! You can show that this answer satisfies your needs by upvoting it and accepting it as the correct answer http://stackoverflow.com/help/someone-answers – dtell Apr 03 '17 at 21:28
  • If your XML file containes just one `` you can replace `for element in results.find_all("tag"):` with `element = results.find("tag")`. See this section of the BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find – dtell Apr 03 '17 at 21:35
  • Hi, thanks for your reply! I did as follows `element = results.find('tag')` and the _element_ variable contains the following `TR=111111 Sandbox=3000613`, instead of all the _stat_ tags as expected. Not sure if I'm doing something wrong or if the XML file is at fault. –  Apr 04 '17 at 13:48
  • I think I realized what's going on. There are a few `something` on the XML file. I'm sure that's messing around with the _find_. Probably will need to use some regex to get the `values` that I need. –  Apr 04 '17 at 13:57
  • I don't know the exact structure of your XML file but you don't need special regex to find your tag since BeautifulSoup definitely provides everything you need. I'm gonna add another answer in a few to show you some examples how you could use it. Just be so kind and add as much of the file structure in the comments as you can. – dtell Apr 04 '17 at 13:59
  • Probably I will need to do that. Not sure if I need BeautifulSoup after all, since the content of the XML file will always have the same structure. I pasted the XML file in here, since it's quite big: https://pastebin.com/apdBcWBj Appreciate your help, I'm completely new to this and there is so much to learn! –  Apr 04 '17 at 14:07
  • And you need all values for the `pass` tag? – dtell Apr 04 '17 at 14:15
  • I need the following data: the text inside the _pass_ tag (the value of either TR or Sandbox, which I get according a variable) and the actual _pass_ value that is within the _stat_ tag. The reason is that I need to parse whether the test passed or failed (the _pass_ value) and the test ID (what is inside the tag). –  Apr 04 '17 at 14:20
  • I just saw the addition you made to the original reply. Awesome! Thanks a lot for taking your time to explain with so much detail, I truly appreciate it! –  Apr 04 '17 at 15:41
1

The problem is that find_all('tag') returns the whole html block entitled tag:

>>> results.find_all('tag')                                                                      
[<tag>                                                                                     
<stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>                                   
<stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>                                   
<stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>                                   
</tag>]

Your intention is to collect each of the stat blocks, so you should be using results.find_all('stat'):

>>> stat_blocks = results.find_all('stat')                                                                      
[<stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>, <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>, <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>]

From there, it is trivial to fix the code to condense 'pass' into a list:

>>> passes = [s['pass'] if s is not None else None for s in stat_blocks]                   
>>> passes                                                                                   
['1', '1', '1']  

Or print:

>>> for s in stat_blocks:                                                                  
...     print(s['pass'])                                                                   
...                                                                                        
1                                                                                          
1                                                                                          
1     

In python, it's really important to test results because the typing is way too dynamic to trust your memory. I often include a static test function in classes and modules to ensure that the return types and values are what I expect them to be.

Aaron3468
  • 1,734
  • 16
  • 29
  • Thanks, it makes sense. I shouldn't have mentioned that there are more `stats` attributes on the XML file, but I'm only interested on the ones inside the `tag` node. Thanks for your reply, appreciate it! –  Apr 03 '17 at 21:28
  • @Xour Ah, fair enough, then you just use `results.find_all('tag').find_all('stat')`. Upvote any answers you find helpful and informative and double check that you've selected a best answer. Cheers! – Aaron3468 Apr 03 '17 at 21:33
0

Your "tag" can have multiple "stat" entries. Do you only have one "tag" entry?

If so then first find the "tag", then loop through the "stat" entries that are contained within the "tag" entry. Something like:

for stat in soup.find("tag").find_all("stat"):
    print(stat["pass"])
RobertB
  • 1,879
  • 10
  • 17
  • Hi. Just one `tag` entry. However, for some reason, when I run your code it doesn't return anything. If I remove the `.find_all("stat")` part (just for debug) it returns the very first _stat_ tag. Thanks for your reply! –  Apr 03 '17 at 21:24
  • Based on @Aaron3468 and my post, you should be able to noodle it out. Doing a "find" on "tag" should return the entire contents of the "tag" which is all of the "stat"s. Not sure how to explain what you are seeing. – RobertB Apr 03 '17 at 21:36
  • I'm not sure whether is something on my XML file or what, but I tried that approach (same as suggested by datell above) but it returns nothing. If I do: `with open('../results/output.xml') as raw_resuls: results = soup(raw_resuls, 'lxml') for stat in results.find("tag").find_all("stat"): print 'test' print(stat["pass"])` Nothing is printed, even the _test_ string, not sure why. PS: Sorry, I just can't format the code properly! –  Apr 04 '17 at 13:41
  • I think I realized what's going on. There are a few `something` on the XML file. I'm sure that's messing around with the _find_. Probably will need to use some regex to get the `values` that I need. –  Apr 04 '17 at 13:57
  • regex is probably not needed. It is a nested loop. Just loop through the tags using a find_all on the soup looking for "tag" items. Within each of those items, do a separate find_all for the "stat" elements. The first find_all is on the "soup" the second find_all is on the individual elements returned by the first loop. – RobertB Apr 04 '17 at 16:36
0

If you are here just like me, seeking for a most simple and short solution, try this one to get the attributes from your tag.

soup = BeautifulSoup(''' 
    <html> 
        <h2 class="hello"> Heading 1 </h2> 
        <h1> Heading 2 </h1> 
    </html> 
    ''', "lxml") 
  
# Get the whole h2 tag 
tag = soup.h2 
  
# Get the attribute 
attribute = tag['class'] 
muinh
  • 535
  • 6
  • 14