1

I have been trying to scrape an XML file to copy content from 2 tags, Code and Source only. The xml file looks as follows:

<Series xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RunDate>2018-06-12</RunDate>
  <Instruments>
    <Instrument>
      <Code>27BA1</Code>
      <Source>YYY</Source>
    </Instrument>
    <Instrument>
      <Code>28BA1</Code>
      <Source>XXX</Source>
    </Instrument>
      <Code>29BA1</Code>
      <Source>XXX</Source>
    </Instrument>
      <Code>30BA1</Code>
      <Source>DDD</Source>
    </Instrument>
  </Instruments>
</Series>

I'm only getting it right to scrape the first code. Below is the code. Can anyone help?

import xml.etree.ElementTree as ET
import csv

tree = ET.parse("data.xml")
csv_fname = "data.csv"
root = tree.getroot()

f = open(csv_fname, 'w')
csvwriter = csv.writer(f)
count = 0
head = ['Code', 'Source']

csvwriter.writerow(head)

for time in root.findall('Instruments'):
    row = []
    job_name = time.find('Instrument').find('Code').text
    row.append(job_name)
    job_name_1 = time.find('Instrument').find('Source').text
    row.append(job_name_1)
    csvwriter.writerow(row)
f.close()
Anubhav Singh
  • 587
  • 2
  • 11
dps
  • 139
  • 3
  • 11

2 Answers2

5

The XML file given by you in the post is invalid. Check by pasting the file here. https://www.w3schools.com/xml/xml_validator.asp

The valid xml I assume would be

<Series xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RunDate>2018-06-12</RunDate>
  <Instruments>
    <Instrument>
      <Code>27BA1</Code>
      <Source>YYY</Source>
    </Instrument>
    <Instrument>
      <Code>28BA1</Code>
      <Source>XXX</Source>
    </Instrument>
    <Instrument>
      <Code>29BA1</Code>
      <Source>XXX</Source>
    </Instrument>
    <Instrument>
      <Code>30BA1</Code>
      <Source>DDD</Source>
    </Instrument>
  </Instruments>
</Series>

To print values in Code and Source tags.

from lxml import etree
root = etree.parse('data.xml').getroot()
instruments = root.find('Instruments')
instrument = instruments.findall('Instrument')
for grandchild in instrument:
    code, source = grandchild.find('Code'), grandchild.find('Source')
    print (code.text), (source.text)
Anubhav Singh
  • 587
  • 2
  • 11
0

If you are able to run xslt against your document - I assume you can - an alternative approach would make this very straightforward:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
>
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>Code,Source</xsl:text><xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="//Instrument"/>
  </xsl:template>
  <xsl:template match="Instrument">
<xsl:value-of select="Code"/>,<xsl:value-of select="Source"/><xsl:text>&#xa;</xsl:text>
</xsl:template>
</xsl:stylesheet>

Note the presence of the <xsl:text>&#xa;</xsl:text> element - this is to insert the line breaks which are semantically important in CSV, but not in XML.

Output:

Code,Source
27BA1,YYY
28BA1,XXX
29BA1,XXX
30BA1,DDD

To run this in Python I guess you'd need something like the approach suggested in this question:

import lxml.etree as ET

dom = ET.parse(xml_filename)
xslt = ET.parse(xsl_filename)
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))

I don't use Python, so I have no idea whether this is correct or not.

Whoops - I also neglected to mention that your XML document is not valid - there are missing opening <Instrument> elements on lines 11 and 14. Adding these where they belong makes the document transform correctly.

Tom W
  • 5,108
  • 4
  • 30
  • 52
  • Hi. I have not idea how to do that. Any guidance would be appreciated. Thanks – dps Jun 14 '18 at 08:56
  • You haven't specified anything about the language or environment you're using. I don't recognise the language in your question - so by extension I also don't know what you're using to execute it. The best way to run a stylesheet against a document depends on your tools - could you please specify in the question. Thanks. – Tom W Jun 14 '18 at 09:03
  • I don't think this is what I'm looking for. I'm just looking for someone to [lease look at my Python code and tell me what I'm doing wrong. Not looking to use xslt. Thanks for your help. – dps Jun 14 '18 at 12:45