0

Is there a way to convert an XML file (financial statements from IB API) to Pandas without knowing the exact column headers? The rows should reflect the different dates (there are 4 or more datapoints per column). Would also be great to get balance sheet, income statement and cash flow statement separately. I have tried to use beautiful soup but am getting frustrated because it seems like I need to look for each column header specifically and I don't know how to get the data for each date.

Im trying to get three separate dataframes (one for each financial statement). sorry I don't know how to add a table here but they should look something like this.

Df1 name = income statement

|---------------------|------------------|------------------|
|      Date           |     SREV         |  VDES            |
|---------------------|------------------|------------------|
|   2018-09-29        | 265595.000000    | 12.208930        |
|---------------------|------------------|------------------|
|  .....              |   ......         | .....            |
|---------------------|------------------|------------------|

the example is from this segment:

    <Statement Type="INC">
                        <FPHeader>
                            <PeriodLength>52</PeriodLength>
                            <periodType Code="W">Weeks</periodType>
                            <UpdateType Code="UPD">Updated Normal</UpdateType>
                            <AccountingStd/>
                            <StatementDate>2018-09-29</StatementDate>
                            <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                            <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                            <Source Date="2018-11-05">10-K</Source>
                        </FPHeader>
                        <lineItem coaCode="SREV">265595.000000</lineItem>
    (...)
                        <lineItem coaCode="VDES">12.208930</lineItem>

This is the XML file (about half due to character limit):

<?xml version="1.0" encoding="utf-8"?>
<ReportFinancialStatements Major="1" Minor="0" Revision="1">
    <CoIDs>
        <CoID Type="RepNo">05680</CoID>
        <CoID Type="CompanyName">Apple Inc.</CoID>
        <CoID Type="IRSNo">942404110</CoID>
        <CoID Type="CIKNo">0000320193</CoID>
    </CoIDs>
    <Issues>
        <Issue Desc="Common Stock" ID="1" Order="1" Type="C">
            <IssueID Type="Name">Ordinary Shares</IssueID>
            <IssueID Type="Ticker">AAPL</IssueID>
            <IssueID Type="RIC">AAPL.O</IssueID>
            <IssueID Type="DisplayRIC">AAPL.OQ</IssueID>
            <IssueID Type="InstrumentPI">331724</IssueID>
            <IssueID Type="QuotePI">7645713</IssueID>
            <Exchange Code="NASD" Country="USA">NASDAQ</Exchange>
            <MostRecentSplit Date="2014-06-09">7.0</MostRecentSplit>
        </Issue>
    </Issues>
    <CoGeneralInfo>
        <CoStatus Code="1">Active</CoStatus>
        <CoType Code="EQU">Equity Issue</CoType>
        <LastModified>2020-01-23</LastModified>
        <LatestAvailableAnnual>2019-09-28</LatestAvailableAnnual>
        <LatestAvailableInterim>2019-09-28</LatestAvailableInterim>
        <ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
        <MostRecentExchange Date="2020-01-22">1.0</MostRecentExchange>
    </CoGeneralInfo>
    <StatementInfo>
        <COAType Code="IND">Industry</COAType>
        <BalanceSheetDisplay Code="CUR">Differentiates</BalanceSheetDisplay>
        <CashFlowMethod Code="IND">Indirect</CashFlowMethod>
    </StatementInfo>
    <Notes>
        <CFAAvailability Code="1"/>
        <IAvailability Code="1"/>
        <ISIAvailability Code="1"/>
        <BSIAvailability Code="1"/>
        <CFIAvailability Code="1"/>
    </Notes>
    <FinancialStatements>
        <COAMap>
            <mapItem coaItem="SREV" lineID="100" precision="1" statementType="INC">Revenue</mapItem>
(...)
            <mapItem coaItem="SCTP" lineID="1050" precision="1" statementType="CAS">Cash Taxes Paid</mapItem>
        </COAMap>
        <AnnualPeriods>
            <FiscalPeriod EndDate="2019-09-28" FiscalYear="2019" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">260174.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">11.885790</lineItem>
                </Statement>
                <Statement Type="BAL">
                    <FPHeader>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ACSH">12204.000000</lineItem>
(...)
                    <lineItem coaCode="STBP">20.365340</lineItem>
                </Statement>
                <Statement Type="CAS">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ONET">55256.000000</lineItem>
(...)
                    <lineItem coaCode="SNCC">24311.000000</lineItem>
                </Statement>
            </FiscalPeriod>
            <FiscalPeriod EndDate="2018-09-29" FiscalYear="2018" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2018-09-29</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2018-11-05">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">265595.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">12.208930</lineItem>
                </Statement>
                <Statement Type="BAL">
                    <FPHeader>
                        <UpdateType Code="CLA">Reclassified Normal</UpdateType>
                        <StatementDate>2018-12-29</StatementDate>
                        <Source Date="2019-01-30">10-Q</Source>
                    </FPHeader>
                    <lineItem coaCode="ACSH">11575.000000</lineItem>
(...)
                    <lineItem coaCode="STBP">22.533610</lineItem>
                </Statement>
                <Statement Type="CAS">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2018-09-29</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2018-11-05">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ONET">59531.000000</lineItem>
(...)
                    <lineItem coaCode="SNCC">5624.000000</lineItem>
                </Statement>
            </FiscalPeriod>
            <FiscalPeriod EndDate="2017-09-30" FiscalYear="2017" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>53</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2017-09-30</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2017-11-03">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">229234.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">9.206750</lineItem>
                </Statement>
                <Statement Type="BAL">
Joan Arau
  • 151
  • 4
  • 14
  • Hi! I think it would be helpful if you'd truncate repeated entries, so that we could see the structure of your entire XML doc. E.g. replace most of the `mapItem` entries with `...`. – Jonny5 Jan 25 '20 at 18:05
  • 1
    thanks for the tip, since I have never really worked with xml I didn't know what parts were important to put here. Edit: took out some of the repeating stuff – Joan Arau Jan 25 '20 at 18:07
  • No problem, basically when tags are repeated, they do not really provide much value if they don't introduce new attributes or children. It is important though to understand whether they are *repeated* fields. Typically XSDs (XML Schema Definitions) or DTDs (Document Type Definitions) are used to describe that structure. See for example [this post](https://stackoverflow.com/a/45742493/787036). XPath might also come in handy. These XML concepts might help you in the future. ;) [This blog post](https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c) might help you as well. – Jonny5 Jan 25 '20 at 18:14
  • thank you for these resources. I actually read the blog post before and I think I understand how it works with fixed column headers. My issue is with how to deal with varying column headers and multiple tables in the same xml – Joan Arau Jan 25 '20 at 18:18
  • Do you have some (cleaned up) sample code you can add to your post? That might help us. – Jonny5 Jan 25 '20 at 18:21
  • So far I only have the code to get the xml file from the API, no processing/parsing yet. From what I understand there should be a separate pandas df for each for example. then 2018-12-29 indicates the row and 9489.000000 would be the column: ACSH with value 9489.000000 in row 2018-12-29 – Joan Arau Jan 25 '20 at 18:23
  • Are you sure you want multiple data frames? It would be very useful if you could add what you expect to get out of exactly this xml to your post. I've also formatted the xml code to make it easier to read, but it is awaiting peer review. – Jonny5 Jan 25 '20 at 18:32
  • 1
    Thank you, I approved your edit. I will add a sample table/df – Joan Arau Jan 25 '20 at 18:33
  • Does the edit I have made clarify what I am looking for? – Joan Arau Jan 25 '20 at 20:15
  • Found below given useful. Hope it might help you out.[https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c] – Muhammad Hamza Sabir Aug 25 '20 at 18:20

1 Answers1

1

I know its not pretty but this works:

from ib_insync import *
from bs4 import BeautifulSoup as bs
import pandas as pd

ib = IB()
ib.connect('127.0.0.1', 7497, clientId=1)


security = Stock('AAPL', 'SMART', 'USD')

# request the fundamentals
fundamentals = ib.reqFundamentalData(security, reportType='ReportsFinStatements')

soup = bs(fundamentals,'xml')

bal_l = []
inc_l = []
cas_l = []


for period in soup.find_all('FiscalPeriod'):
    if period.get('Type') != "Annual":
        for statement in period.find_all('Statement'):
            if statement.find('UpdateType').get('Code') != 'CLA':
                dic = {}


                t = statement.get('Type')
                d = statement.find('Source').get('Date')
                d1 = statement.find('StatementDate').text
                dic['date'] = d
                dic['StatementDate'] = d1


                for item in statement.find_all('lineItem'):
                    dic[item.get('coaCode')] =item.text


                if t == 'BAL':
                    bal_l.append(dic)
                    print(t, d, dic)
                elif t == 'INC':
                    inc_l.append(dic)
                elif t == 'CAS':
                    cas_l.append(dic)

balancesheet = pd.DataFrame(bal_l).sort_values('date')

with pd.option_context('display.max_rows', 1000, 'display.max_columns', None):
    print(balancesheet)
Joan Arau
  • 151
  • 4
  • 14