2

I want to start by saying I have no experience working with XML files. I just started poking around this data in the past couple of days and have some code that somewhat works, it just needs a couple tweaks. The code I have is here:

import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET

#parses the orders.xml file and stores in the "tree" variable
tree = ET.parse("orders (1).xml")

#gets the root of the parsed tree
root = tree.getroot()

#creating lists to store xml data in
OrderID = []
CustomerID = []
SKU = []
Price = []
Quantity = []

#goes through the xml file looking for the specified tag and stores the value into the list created above
for orderid in root.iter('OrderId'):
    OrderID.append(orderid.text)

for customerid in root.iter('CustomerId'):
   CustomerID.append(customerid.text)

for sku in root.iter('Sku'):
   SKU.append(sku.text)

for price in root.iter('PriceInclTax'):
   Price.append(price.text)

for quantity in root.iter('Quantity'):
    Quantity.append(quantity.text)






# writes the data stored in the lists above to a pandas dataframe
df = pd.DataFrame(
    list(zip(OrderID, CustomerID, SKU, Price, Quantity)), columns=['Order ID', 'Customer ID', 'SKU', 'Price', 'Quantity']
)

# prints the dataframe
print(df)

Here is the result:

  Order ID Customer ID                SKU          Price Quantity
0       518    18321022    2BF207HT750-150        83.2300        8
1       517    18265518    4BF209HT750-175       107.5200        2
2       516    18344682       HUCP211-35KX       560.0000        2
3       515    18344503        6206-2RC4KX        58.8300        1
4       514    18352425        6206-ZZC4KX        47.8900       13
5       513    18265596       HUCF206-19KX       242.0000        8
6       512    18352395    4BF209HT750-175       352.7800       10
7       511    18266176  4BF209SHT750-175P       825.7200       10
8       510    18351938         SUCF208-24  28251299.4606        4
9       509    18265518             125855       567.4500        1
10      508    18351852        6003-2RC4KX        73.4686        1
11      506    18350111    4BF212HT750-250       152.4800        4
12      505    18327149          6200-2RSH         4.4600        6
13      504    18351172        6205-2RC4KX        68.0700        1
14      503    18266179        6203-2RC4KX        62.3800        1
15      502    18266179    4BF210HT750-200       162.2300        6
16      501    18265517        6204-ZZC4KX        34.8400        1
17      500    18265548        CG3-7000-NE         0.0000        1
18      499    18348139         CG3-7000-E         0.0000        1
19      498    18321909         CG3-7000-E      8920.0000        1
20      497    18337101        CG3-7000-NE     10770.0000        1
21      495    18348878    4BF207HT750-150       281.6700       24
22      494    18348911          UCF211-35       185.4930        6
23      493    18321909        HUC207-23KX       276.3900        2
24      492    18265594  4BF316SHT100-400P      1801.3900       20
25      491    18265533  2PB316SHT750-400P      1651.4400        1
26      490    18347438         SUCF205-16       134.8800        2

The issue I am having is some of these orders have multiple items on it, and this is not reflecting that. The item data for each order is stored in the tag and then . The code I have does not pick up that there can be multiple 's for each . Here is the source code of the XML document for Order ID 515

<Order>
        <OrderId>515</OrderId>
        <OrderGuid></OrderGuid>
        <StoreId>1</StoreId>
        <CustomerId></CustomerId>
        <OrderStatusId></OrderStatusId>
        <PaymentStatusId></PaymentStatusId>
        <ShippingStatusId></ShippingStatusId>
        <CustomerLanguageId></CustomerLanguageId>
        <CustomerTaxDisplayTypeId></CustomerTaxDisplayTypeId>
        <CustomerIp></CustomerIp>
        <OrderSubtotalInclTax>681.4000</OrderSubtotalInclTax>
        <OrderSubtotalExclTax>681.4000</OrderSubtotalExclTax>
        <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
        <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
        <OrderShippingInclTax>0.0000</OrderShippingInclTax>
        <OrderShippingExclTax>0.0000</OrderShippingExclTax>
        <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
        <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
        <TaxRates>0.00:0.00;   </TaxRates>
        <OrderTax>0.0000</OrderTax>
        <OrderTotal>681.4000</OrderTotal>
        <RefundedAmount>0.0000</RefundedAmount>
        <OrderDiscount>0.0000</OrderDiscount>
        <CurrencyRate>1.0000</CurrencyRate>
        <CustomerCurrencyCode>USD</CustomerCurrencyCode>
        <AffiliateId>0</AffiliateId>
        <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
        <CardType />
        <CardName />
        <CardNumber />
        <MaskedCreditCardNumber />
        <CardCvv2 />
        <CardExpirationMonth />
        <CardExpirationYear />
        <PaymentMethodSystemName>Payments.PurchaseOrder</PaymentMethodSystemName>
        <AuthorizationTransactionId />
        <AuthorizationTransactionCode />
        <AuthorizationTransactionResult />
        <CaptureTransactionId />
        <CaptureTransactionResult />
        <SubscriptionTransactionId />
        <PaidDateUtc>03/08/2023 21:29:31</PaidDateUtc>
        <ShippingMethod>UPS Next Day Air®</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
        <CustomValuesXml></CustomValuesXml>
        <VatNumber />
        <Deleted>False</Deleted>
        <CreatedOnUtc>3/8/2023 9:29:31 PM</CreatedOnUtc>
        <OrderItems>
            <OrderItem>
                <Id>633</Id>
                <OrderItemGuid>1cb035ca-73c8-4124-94d9-c14c54b669d8</OrderItemGuid>
                <Name>30mm 6206 High Temperature 220 °C Sealed C4 Bearing</Name>
                <Sku>6206-2RC4KX</Sku>
                <PriceExclTax>58.8300</PriceExclTax>
                <PriceInclTax>58.8300</PriceInclTax>
                <Quantity>1</Quantity>
                <DiscountExclTax>0.0000</DiscountExclTax>
                <DiscountInclTax>0.0000</DiscountInclTax>
                <TotalExclTax>58.8300</TotalExclTax>
                <TotalInclTax>58.8300</TotalInclTax>
            </OrderItem>
            <OrderItem>
                <Id>634</Id>
                <OrderItemGuid>1a4a005f-e669-4420-9379-611e5b3f637c</OrderItemGuid>
                <Name>30mm 6206 High Temperature 250 °C Shielded ZZ C4 Bearing</Name>
                <Sku>6206-ZZC4KX</Sku>
                <PriceExclTax>47.8900</PriceExclTax>
                <PriceInclTax>47.8900</PriceInclTax>
                <Quantity>13</Quantity>
                <DiscountExclTax>0.0000</DiscountExclTax>
                <DiscountInclTax>0.0000</DiscountInclTax>
                <TotalExclTax>622.5700</TotalExclTax>
                <TotalInclTax>622.5700</TotalInclTax>
            </OrderItem>
        </OrderItems>
        <Shipments>
            <Shipment>
                <ShipmentId>391</ShipmentId>
                <TrackingNumber>1Z4054470191053867</TrackingNumber>
                <TotalWeight>6.1600</TotalWeight>
                <ShippedDateUtc>3/8/2023 9:31:38 PM</ShippedDateUtc>
                <DeliveryDateUtc />
                <CreatedOnUtc>03/08/2023 21:31:38</CreatedOnUtc>
            </Shipment>
        </Shipments>
  </Order>

What I need help with is having the orders that have multiple line items on it, and being able to write that into the pandas dataframe. OrderID 515 should be the SKU from index 3 and index 4.

I was expecting there to be multiple lines for orders with more than one item, instead items from order 515 went to the next line with the next OrderID 516.

Here is the XML for 2 orders:

 <Order>
    <OrderId>516</OrderId>
    <OrderGuid></OrderGuid>
    <StoreId></StoreId>
    <CustomerId></CustomerId>
    <OrderStatusId>30</OrderStatusId>
    <PaymentStatusId>30</PaymentStatusId>
    <ShippingStatusId>30</ShippingStatusId>
    <CustomerLanguageId>1</CustomerLanguageId>
    <CustomerTaxDisplayTypeId>10</CustomerTaxDisplayTypeId>
    <CustomerIp></CustomerIp>
    <OrderSubtotalInclTax>1120.0000</OrderSubtotalInclTax>
    <OrderSubtotalExclTax>1120.0000</OrderSubtotalExclTax>
    <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
    <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
    <OrderShippingInclTax>0.0000</OrderShippingInclTax>
    <OrderShippingExclTax>0.0000</OrderShippingExclTax>
    <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
    <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
    <TaxRates>0.00:0.00;</TaxRates>
    <OrderTax>0.0000</OrderTax>
    <OrderTotal>1120.0000</OrderTotal>
    <RefundedAmount>0.0000</RefundedAmount>
    <OrderDiscount>0.0000</OrderDiscount>
    <CurrencyRate>1.0000</CurrencyRate>
    <CustomerCurrencyCode>USD</CustomerCurrencyCode>
    <AffiliateId>0</AffiliateId>
    <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
    <CardType />
    <CardName />
    <CardNumber />
    <MaskedCreditCardNumber></MaskedCreditCardNumber>
    <CardCvv2 />
    <CardExpirationMonth />
    <CardExpirationYear />
    <PaymentMethodSystemName></PaymentMethodSystemName>
    <AuthorizationTransactionId></AuthorizationTransactionId>
    <AuthorizationTransactionCode></AuthorizationTransactionCode>
    <AuthorizationTransactionResult>Approved (1: This transaction has been approved.)</AuthorizationTransactionResult>
    <CaptureTransactionId></CaptureTransactionId>
    <CaptureTransactionResult>Approved (1: This transaction has been approved.)</CaptureTransactionResult>
    <SubscriptionTransactionId />
    <PaidDateUtc>03/09/2023 14:43:09</PaidDateUtc>
    <ShippingMethod>Other</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
    <CustomValuesXml />
    <VatNumber />
    <Deleted>False</Deleted>
    <CreatedOnUtc>3/9/2023 2:39:57 PM</CreatedOnUtc>
    <OrderItems>
        <OrderItem>
            <Id>635</Id>
            <OrderItemGuid>9dac1add-310b-41f7-a5af-36ff680b46de</OrderItemGuid>
            <Name>2-3/16" High Temperature 500 °F Mounted Pillow Block Bearing</Name>
            <Sku>HUCP211-35KX</Sku>
            <PriceExclTax>560.0000</PriceExclTax>
            <PriceInclTax>560.0000</PriceInclTax>
            <Quantity>2</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>1120.0000</TotalExclTax>
            <TotalInclTax>1120.0000</TotalInclTax>
        </OrderItem>
    </OrderItems>
    <Shipments>
        <Shipment>
            <ShipmentId>392</ShipmentId>
            <TrackingNumber>1ZB3B6620328442937</TrackingNumber>
            <TotalWeight>17.0000</TotalWeight>
            <ShippedDateUtc>3/9/2023 2:40:56 PM</ShippedDateUtc>
            <DeliveryDateUtc />
            <CreatedOnUtc>03/09/2023 14:40:55</CreatedOnUtc>
        </Shipment>
    </Shipments>
</Order>
<Order>
    <OrderId>515</OrderId>
    <OrderGuid></OrderGuid>
    <StoreId>1</StoreId>
    <CustomerId></CustomerId>
    <OrderStatusId>30</OrderStatusId>
    <PaymentStatusId>10</PaymentStatusId>
    <ShippingStatusId>30</ShippingStatusId>
    <CustomerLanguageId>1</CustomerLanguageId>
    <CustomerTaxDisplayTypeId>10</CustomerTaxDisplayTypeId>
    <CustomerIp></CustomerIp>
    <OrderSubtotalInclTax>681.4000</OrderSubtotalInclTax>
    <OrderSubtotalExclTax>681.4000</OrderSubtotalExclTax>
    <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
    <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
    <OrderShippingInclTax>0.0000</OrderShippingInclTax>
    <OrderShippingExclTax>0.0000</OrderShippingExclTax>
    <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
    <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
    <TaxRates>0.00:0.00;   </TaxRates>
    <OrderTax>0.0000</OrderTax>
    <OrderTotal>681.4000</OrderTotal>
    <RefundedAmount>0.0000</RefundedAmount>
    <OrderDiscount>0.0000</OrderDiscount>
    <CurrencyRate>1.0000</CurrencyRate>
    <CustomerCurrencyCode>USD</CustomerCurrencyCode>
    <AffiliateId>0</AffiliateId>
    <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
    <CardType />
    <CardName />
    <CardNumber />
    <MaskedCreditCardNumber />
    <CardCvv2 />
    <CardExpirationMonth />
    <CardExpirationYear />
    <PaymentMethodSystemName>Payments.PurchaseOrder</PaymentMethodSystemName>
    <AuthorizationTransactionId />
    <AuthorizationTransactionCode />
    <AuthorizationTransactionResult />
    <CaptureTransactionId />
    <CaptureTransactionResult />
    <SubscriptionTransactionId />
    <PaidDateUtc>03/08/2023 21:29:31</PaidDateUtc>
    <ShippingMethod>UPS Next Day Air®</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
    <CustomValuesXml></CustomValuesXml>
    <VatNumber />
    <Deleted>False</Deleted>
    <CreatedOnUtc>3/8/2023 9:29:31 PM</CreatedOnUtc>
    <OrderItems>
        <OrderItem>
            <Id>633</Id>
            <OrderItemGuid>1cb035ca-73c8-4124-94d9-c14c54b669d8</OrderItemGuid>
            <Name>30mm 6206 High Temperature 220 °C Sealed C4 Bearing</Name>
            <Sku>6206-2RC4KX</Sku>
            <PriceExclTax>58.8300</PriceExclTax>
            <PriceInclTax>58.8300</PriceInclTax>
            <Quantity>1</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>58.8300</TotalExclTax>
            <TotalInclTax>58.8300</TotalInclTax>
        </OrderItem>
        <OrderItem>
            <Id>634</Id>
            <OrderItemGuid>1a4a005f-e669-4420-9379-611e5b3f637c</OrderItemGuid>
            <Name>30mm 6206 High Temperature 250 °C Shielded ZZ C4 Bearing</Name>
            <Sku>6206-ZZC4KX</Sku>
            <PriceExclTax>47.8900</PriceExclTax>
            <PriceInclTax>47.8900</PriceInclTax>
            <Quantity>13</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>622.5700</TotalExclTax>
            <TotalInclTax>622.5700</TotalInclTax>
        </OrderItem>
    </OrderItems>
    <Shipments>
        <Shipment>
            <ShipmentId>391</ShipmentId>
            <TrackingNumber>1Z4054470191053867</TrackingNumber>
            <TotalWeight>6.1600</TotalWeight>
            <ShippedDateUtc>3/8/2023 9:31:38 PM</ShippedDateUtc>
            <DeliveryDateUtc />
            <CreatedOnUtc>03/08/2023 21:31:38</CreatedOnUtc>
        </Shipment>
    </Shipments>
</Order>

2 Answers2

2

Consider relating your iterations with nested looping which you can achieve with comprehensions:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("Input.xml")

data = [
    {
     **{
         "OrderId": o.findtext("OrderId"),
         "CustomerId": o.findtext("CustomerId") 
     },
     **{ l.tag: l.text for l in ol.findall("*") }
     **{ l.tag: l.text for l in sh.findall("*") }
    }
    for o in tree.findall(".//Order")
    for ol in o.findall(".//OrderItem")
    for sh in o.findall(".//Shipment")
]

orders_df = pd.DataFrame(data)[
    ["OrderId", "CustomerId", "Sku", "PriceInclTax", "Quantity", "ShippedDateUtc"]
]

orders_df
#   OrderId CustomerId           Sku PriceInclTax Quantity       ShippedDateUtc
# 0     516             HUCP211-35KX     560.0000        2  3/9/2023 2:40:56 PM
# 1     515              6206-2RC4KX      58.8300        1  3/8/2023 9:31:38 PM
# 2     515              6206-ZZC4KX      47.8900       13  3/8/2023 9:31:38 PM
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    This is a very clever solution, but for those who stumble upon it with a bit less experience: Parfait is using multiple _list comprehensions_ as well as _dict unpacking_. So it is finding all orders, then all order items, then all of the tags and text, and creating a list of dictionaries. – Mike Williamson May 03 '23 at 10:30
  • 1
    Thanks @MikeWilliamson! Actually, in addition to nested list comprehensions, this solution uses [dictionary merging](https://stackoverflow.com/a/26853961/1422451) introduced in Python 3.5. – Parfait May 03 '23 at 17:54
  • @Parfait Thanks for helping with this problem back in March. I was curious if you could help also grab the tag from inside the and then tag. I figure it will be similar to how this goes into the tag and grabs info. I tried adding for ol in o.findall(".//Shipment") but that didn't work – Christian Brune Jun 02 '23 at 17:09
  • See [update](https://stackoverflow.com/posts/75692199/revisions) adding another outside list comprehension and inner dict comprehension. – Parfait Jun 02 '23 at 21:53
  • @Parfait This should be the last thing I need lol. Thank you for all of your help. The last piece of info I need is the tag that is inside the tag. There is however another tag within the tag and when I add "CreatedOnUtc" to the orders_df = pd.DataFrame(data) section it prints out the value from inside the tag – Christian Brune Jun 07 '23 at 16:05
  • Add a prefix to dict key: `'Order_' + l.tag` – Parfait Jun 07 '23 at 23:45
0

try using pandas.read_xml

import pandas as pd

# read the xml file with xpath
# we only care about the Order and OrderItem nodes so we pull
# everything from there and select the columns we want to keep
df = pd.read_xml('orders (1).xml', xpath='.//Order|.//OrderItem')[['OrderId', 'CustomerId', 'Sku', 'PriceExclTax', 'Quantity']]
# forward fill the OrderId
df['OrderId'] = df['OrderId'].ffill()
# only keep the rows that have a Sku
df = df[df['Sku'].notnull()].copy()

   OrderId  CustomerId           Sku  PriceExclTax  Quantity
1    516.0         NaN  HUCP211-35KX        560.00       2.0
3    515.0         NaN   6206-2RC4KX         58.83       1.0
4    515.0         NaN   6206-ZZC4KX         47.89      13.0

For xpath . selects the current node and // selects nodes in the document from the current node that match the selection no matter where they are. | is the or operator.

It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • I am getting the following error when trying to run this code: ImportError: lxml not found, please install or use the etree parser. – Christian Brune Mar 10 '23 at 13:22
  • You can either install `lxml` or just add `parser='etree'` since you have that installed already --`df = pd.read_xml('orders (1).xml', xpath='.//Order|.//OrderItem', parser='etree')` – It_is_Chris Mar 10 '23 at 14:12
  • etree parser will not support this advanced xpath expression. – Parfait Mar 10 '23 at 20:40