-1

I would like to extract certain data points from an downloaded xml file (https://s3.amazonaws.com/irs-form-990/201542399349300614_public.xml).

import pandas as pd
import csv
import os
from os import path
from xml.dom import minidom
from xml.etree import ElementTree
import requests
from bs4 import BeautifulSoup
#from IRS_Download import *
import sys

for o in object_id:
    file_name = "" + o + ".xml"
    basepath = path.dirname(__file__)
    filepath = path.abspath(path.join(basepath, file_name))
    dom = minidom.parse(filepath)
    EmIdN = dom.getElementsByTagName('EIN')
    print(EmIdN)

This, however, only returns:

DOM Element: EIN at 0x1132eecc0

Any idea, what I am doing wrong??

mzjn
  • 48,958
  • 13
  • 128
  • 248
Georg
  • 43
  • 1
  • 2
  • 7
  • You need EmIdN[0].firstChild.nodeValue, see https://stackoverflow.com/questions/317413/get-element-value-with-minidom-with-python – Tim Feb 06 '20 at 22:00
  • 1
    Welcome to StackOverflow. See [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Prune Feb 06 '20 at 22:12
  • _Any idea, what Im left to do?_ Are we supposed to guess what's wrong, and what you're trying to do? Also, variable and function names should follow the `lower_case_with_underscores` style. – AMC Feb 06 '20 at 23:44
  • 1
    Does this answer your question? [How do I parse XML in Python?](https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) – MrPickles7 Feb 07 '20 at 02:09

1 Answers1

0

I solved it like this now:

tree = ET.parse(xml_tree)
    root = tree.getroot()
    #prints out all tags to see the paths
    #for elemtn in root.iter():
    #   print(elemtn)
    if tree.find('.//{http://www.irs.gov/efile}EIN') is not None:
        info = tree.find('.//{http://www.irs.gov/efile}EIN').text
        EIN.append(info)
    else:
        info = 'Null'
        EIN.append(info)
Georg
  • 43
  • 1
  • 2
  • 7
  • The only problem with my solution is, however, that some tags have the same name but under different paths. For example, I would like to retrieve tree.find('.//{http://www.irs.gov/efile}PayrollTaxesGrp/TotalAmt').text but that does not work. It only works tree.find('.//{http://www.irs.gov/efile}TotalAmt').text which can give me also other values, as there are many tags with the name TotalAmt. What can I do? – Georg Feb 13 '20 at 18:31