0

come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :

<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>

this is my code:(i know that its so poor and have to improve it :'( )

if soup.find_all('bloc') != None:
for element in soup.find_all('bloc'):
    code_element = element['code']        
    if element.find('m1'):
        m1_element  = element['m1']
    else:
        None
    if element.find('m2'):
        m2_element  = element['m2']
    else:
        None
    print(code_element,m1_element, m2_element)

I ve got the error because the 'm2' element does not exist in all the pages. i dont know how can handle this issue.

i would like to put the result in DataFrame like this.

DatFrame = CODE     A/          B/           C/             D            Page--- Columns
           AF       0000002550  00002550     NULL           NULL         01
           AH       000035826   NULL         000035826      0000035826   01
           AR       000026935   000000024503 0000002431     0000001669   01
....etc.

Thank you so much for your help

1 Answers1

0

A list comprehension of bloc elements with an embedded dict comprehension of bloc attributes is the core. page by appending to dict of bloc attributes, navigating to parent and the required attribute.

Column order is based on order that they are seen

from bs4 import BeautifulSoup
xml = """<detail>
<page number="01">
    <Bloc code="AF" A="000000000002550" B="000000000002550"/>
    <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/>
    <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/>
</page>
<page number="02">
    <Bloc code="DA" A="000000000038486" B="000000000038486"/>
    <Bloc code="DD" A="000000000003849" B="000000000003849"/>
    <Bloc code="EA" A="000000000001029"/>
    <Bloc code="EC" A="000000000063797" B="000000000082427"/>
</page>
    <page number="03">
    <Bloc code="FD" C="000000000574042" D="000000000610740"/>
    <Bloc code="GW" C="000000000052677" D="000000000075362"/>
</page>
</detail>"""

soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}} 
                   for b in soup.find_all("bloc")])


output

code               a               b page               c               d
  AF 000000000002550 000000000002550   01             NaN             NaN
  AH 000000000035826             NaN   01 000000000035826 000000000035826
  AR 000000000026935 000000000024503   01 000000000002431 000000000001669
  DA 000000000038486 000000000038486   02             NaN             NaN
  DD 000000000003849 000000000003849   02             NaN             NaN
  EA 000000000001029             NaN   02             NaN             NaN
  EC 000000000063797 000000000082427   02             NaN             NaN
  FD             NaN             NaN   03 000000000574042 000000000610740
  GW             NaN             NaN   03 000000000052677 000000000075362

elementtree

Very similar to BeautifulSoup

import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}} 
                    for p in root.iter("page") 
                    for b in p.iter("Bloc") ])

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Hey Rob , you are really a super guy thank you so much... please just tell me, should i continue to learn beautifulSoup to parse XML or is this libreary more complicate and hard than Element Tree? if you have an advice to give me ...will appreciate. by the way, thank you so much for your solution. this will really help me . i understood that the page column cant be in the last / its okay for me i am managing with that . thank you again – The Swayly Jan 16 '21 at 15:58
  • I've not used `elementtree` before. Just tried it... really don't see anything to choose between the two. Personally I'd go with BeautifulSoup as it's useful to know for parsing HTML – Rob Raymond Jan 16 '21 at 16:46
  • Okay , thank you for your reply. thank you for your second solution with the ElementTree, please can you tell me what's the meaning of the ** ? in your script? – The Swayly Jan 16 '21 at 17:12
  • https://stackoverflow.com/questions/38987/how-do-i-merge-two-dictionaries-in-a-single-expression-in-python-taking-union-o `**` expands a dict, so I'm using it to merge two dictionaries. pls accept / upvote my answer – Rob Raymond Jan 16 '21 at 20:51
  • Okay ... i ve got it , thank you so much .....am sorry am new in stackoverflow , i have seen the Votes button and clicked on it hope that its how i accept your answer. just to know your solution work very well thank you so much. – The Swayly Jan 17 '21 at 17:46