0

I have an EXTREMELY large document following this structure (>5,000 instances of this):

<Questions>
    <QuestionID>558013</QuestionID>
    <Question>All of the following materials are categorized as &lt;chr8220&gt;fine art&lt;chr8221&gt; EXCEPT</Question>
    <Answer1>textiles</Answer1>
    <Answer2>paintings</Answer2>
    <Answer3>drawings</Answer3>
    <Answer4>sculptures</Answer4>
    <Answer5>architecture</Answer5>
    <AnswerGuide>Textile is not included in the category of fine art. Traditionally, textiles have been categorized as craft art.</AnswerGuide>
    <TypeID>1</TypeID>
    <Source>6,1,3</Source>
    <Footnote />
    <CardTypeID>0</CardTypeID>
    <Year>2016</Year>
    <SubjectID>41</SubjectID>
    <QuesNumber>4</QuesNumber>
    <AuxNum>4</AuxNum>
    <RandList>43512</RandList>
    <ResourceTypeID>382</ResourceTypeID>
    <TreeKey>01/01/01/</TreeKey>
    <TestID>41901</TestID>
    <DiffShort>N</DiffShort>
    <CardType />
</Questions>

I have no need for fields TypeID through CardType, and it would make it far easier to remove those fields. Currently, I'm just using Notepad ++ to edit this XML, and can't figure out an easy way to remove all of those fields and their contents. Is it possible to do so? Ideally, it would simplify the above to:

<Questions>
    <QuestionID>558013</QuestionID>
    <Question>All of the following materials are categorized as &lt;chr8220&gt;fine art&lt;chr8221&gt; EXCEPT</Question>
    <Answer1>textiles</Answer1>
    <Answer2>paintings</Answer2>
    <Answer3>drawings</Answer3>
    <Answer4>sculptures</Answer4>
    <Answer5>architecture</Answer5>
    <AnswerGuide>Textile is not included in the category of fine art. Traditionally, textiles have been categorized as craft art.</AnswerGuide>
</Questions>
Liam Wilson
  • 179
  • 1
  • 9
  • One way to achieve this is by using regex. You can select the part you need and then wrap that inside your tag to make a new XML file. One useful regex for your case (Python compatible): (\d+)\n\s+(.+)\n\s+(\w+)\n\s+(\w+)\n\s+(\w+)\n\s+(\w+)\n\s+(\w+)\n\s+(.+) – caped114 Dec 26 '16 at 21:00
  • Oh, no, @caped114! One defacto rule of modern programming (somewhere etched on stone tablets) is not to run [regex on X/HTML](http://stackoverflow.com/a/1732454/1422451) documents as these are not natural languages. – Parfait Dec 27 '16 at 00:05

1 Answers1

0

Consider XSLT, the declarative, special-purpose language designed specifically to transform XML files to various end uses. Below are two approaches. Save either as .xsl file and apply it to your .xml file. XSL files are well-formed XML files and can be parsed like any other XML.

Keep Desired Nodes (keeps only nodes with 'Question' or 'Answer' in its name)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <!-- Identity transform -->
   <xsl:template match="@* | node()">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()" />
      </xsl:copy>
   </xsl:template>

   <!-- Questions template -->
   <xsl:template match="Questions">
     <xsl:copy>
      <xsl:copy-of select="*[contains(name(),'Question') or contains(name(),'Answer')]"/>
     </xsl:copy>
   </xsl:template>

</xsl:stylesheet>

Remove Undesired Nodes (removes all nodes without 'Question' or 'Answer' in its name)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <!-- Identity transform -->
   <xsl:template match="@* | node()">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()" />
      </xsl:copy>
   </xsl:template>

   <!-- Empty template -->
   <xsl:template match="Questions/*[not(contains(name(),'Question')) and not(contains(name(),'Answer'))]"/>

</xsl:stylesheet>

How to Run XSL Scripts?

Notepad++ itself is not an XSLT processor but only an editor. Most general purpose languages carry XSLT 1.0 processors in various extensions or libraries including Java, C#, Perl, Python, PHP, VB, and still more. Additionally, dedicated executables such as Xalan and Saxon can run XSLT scripts even higher level 2.0 and 3.0 types. Moreover, command line interpreters such as Windows PowerShell and Unix Bash can run them. Even xsltproc run from terminal is pre-installed on most Linux/Mac OS.

Warnings

XSLT tends to be memory-intensive processing requiring entire document to be read in and maintained in memory. As such they are great on smaller files but do not scale on large files. However, if you have sufficient RAM capacity, somewhere between 5X the size of XML document (rough estimation), then you may be able to process such XSLT in a suitable amount of time and resources. Certainly, if you split your large document apart into smaller pieces, XSLT can run even smoother.

Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125