0

I'm a beginner to Python and using dataframes thorugh Pandas. I'm trying to extract a table from an XML file using xml.dom.minidom into an Excel file. This is what the original table should look like (notice the black entry under 'Bike'):

VEHICLE   BRAND
Car       Mercedes
Bike      Kawasaki
          Ducati
Truck     Ram

I am trying to extract this table from the given XML file:

<Info_Collection>
    <Info car="Car">
        <V_Collection>
            <Brand type="Mercedes"/>
        </V_Collection>
    </Info>
    <Info car="Bike">
        <V_Collection>
            <Brand type="Kawasaki"/>
            <Brand type="Ducati"/>
        </V_Collection>
    </Info>
    <Info car="Truck">
        <V_Collection>
            <Brand type="Ram"/>
        </V_Collection>
    </Info>
</Info_Collection>

This is the code that I am using:

def main():
  x1=[]
  x2=[]
  doc = xml.dom.minidom.parse('xml_file')
  t1 = doc.getElementsByTagName("Info")
  t2 = doc.getElementsByTagName("Brand")
  for a in t1:
     x1.append(tb.getAttribute("car"))
  for a in t2:
     x2.append(tb.getAttribute("type"))
  while len(x1) != len(x2): 
     x1.append("")
  boDF = pd.DataFrame({'VEHICLE': x1, 'BRAND':x2})
  boDF.to_excel(writer, sheet_name='Sheet1', index=0, startrow=1)
  writer.save()

if __name__ == "__main__":
  main()

After running it, the output table is as follows:

VEHICLE   BRAND
Car       Mercedes
Bike      Kawasaki
Truck     Ducati
          Ram

Could someone kindly help me figure out how to insert a space between 'Bike' and 'Truck'? I tried to run both for loops concurrently and compared their lengths to see if they were equal or not and when they are not, a blank space would be added to the first column. However, I cannot get it to work. I know that the while loop in my code adds a space to the end of the first column, but I cannot figure out how to add anywhere inside the column.

Alston
  • 1
  • 1
  • Is this maybe helping you https://stackoverflow.com/questions/52968877/read-xml-file-to-pandas-dataframe or this https://gokhanatil.com/2017/11/python-for-data-science-importing-xml-to-pandas-dataframe.html – SMS Dec 05 '19 at 17:49
  • Thank you for your comment, however, these links just show how to create a dataframe from an XML file. It does not explain how to add a space in between a column. – Alston Dec 05 '19 at 18:37
  • Is there any reason why you’re using minidom? Have you inspected the data at various steps in the program? https://ericlippert.com/2014/03/05/how-to-debug-small-programs/ I’m also curious as to why you want to leave a blank cell. – AMC Dec 05 '19 at 21:46
  • I am using minidom because I learnt it from an online course. I haven't tried using any other way to extract the file. Yes, I have inspected how the code runs at various parts of the method but cannot create a solution to work around this issue (not sure if I'm answering the question but I will go over the link and try to implement the tips.) And the whole point of adding the blank cell is so that, when you read the excel file, each cell align with their corresponding cell. (As per the above tables: Ducati makes bikes, not trucks. Hence, the 'Truck' cell should be moved a cell down.) – Alston Dec 06 '19 at 14:17

1 Answers1

0

See below.

The idea is to create a csv file which is usually associated with Excel. The XML parsing is done by builtin python XML parsing library 'ElementTree'. When you double click the csv file it will usually will be opened by Excel and you will get the table you are looking for.

Note that the XML you have posted is not a valid XML and had to be fixed (v_Collection Vs V_Collection)

import xml.etree.ElementTree as ET

xml = '''<Info_Collection>
    <Info car="Car">
        <V_Collection>
            <Brand type="Mercedes"/>
        </V_Collection>
    </Info>
    <Info car="Bike">
        <V_Collection>
            <Brand type="Kawasaki"/>
            <Brand type="Ducati"/>
        </V_Collection>
    </Info>
    <Info car="Truck">
        <V_Collection>
            <Brand type="Ram"/>
        </V_Collection>
    </Info>
</Info_Collection>'''

data = [['VEHICLE', 'BRAND']]
root = ET.fromstring(xml)
info_list = root.findall('.//Info')
cars = set()
for info_entry in info_list:
    car = info_entry.attrib['car']
    for brand in [brand.attrib['type'] for brand in info_entry.findall('.//Brand')]:
        data.append([car if car not in cars else '', brand])
        cars.add(car)
with open('cars.csv', 'w') as out:
    for entry in data:
        out.write(','.join(entry) + '\n')

output (cars.csv)

VEHICLE,BRAND
Car,Mercedes
Bike,Kawasaki
,Ducati
Truck,Ram
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Thank you for your reply! I'm sorry for the errors in the xml file. But, I don't think I will be able to use this code because this is just a small part of the entire xml file and I know there is a way to scan the xml file directly. However, I am looking for a way to do this through minidom. I'm probably going to output each row into a separate dataframe. Thank you though! – Alston Dec 09 '19 at 14:59
  • This code can be modified and read a file instead (`tree = ET.parse('my_file.xml')`). So you can use this with no problems :-) – balderman Dec 09 '19 at 15:06