How to ignore comments while reading an XML file in Pyspark Databricks?

Question

I am trying to read an xml file in Azure Databricks Notebook in PySpark. The problem is that my persons.xml has some comments in the beginning. I just want to ignore them while reading the file.

df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("src/main/resources/persons.xml")

My XML looks like this:

        <?xml version="1.0" encoding="UTF-8"?>
    <!-- 
<top>
       <t1 attr1="a1">
          <!-- t1 comment -->
          <t2>Something 1</t2>
       </t1>
       <!-- between rows comment -->
       <t1 attr1="a2">
          <t2>Something 2</t2>
       </t1>
    </top> 
    --> 
        <naman>
           <t1 attr1="a1">
              <t2>Something 1</t2>
           </t1>
           <t1 attr1="a2">
              <t2>Something 2</t2>
           </t1>
        </naman>

score 0 · Answer 1 · answered Nov 28 '21 at 09:14

0

Comments are ignored by default, if you see them, then it's something strange. for example, if I have following XML file:

<!-- top comment -->
<top>
  <t1 attr1="a1">
    <!-- t1 comment -->
    <t2>Something 1</t2>
  </t1>
  <!-- between rows comment -->
  <t1 attr1="a2">
    <t2>Something 2</t2>
  </t1>
</top>

then it's could be read as, and no comments are captured:

>>> df = spark.read.format("com.databricks.spark.xml") \
  .option("rowTag", "t1").load("1.xml")
>>> df.show()
+------+-----------+
|_attr1|         t2|
+------+-----------+
|    a1|Something 1|
|    a2|Something 2|
+------+-----------+

answered Nov 28 '21 at 09:14

Alex Ott

80,552
8
87
132

Actually the kind of XML which I have is a bit different from this. In the XML above, the comments do not have XML tags within them. However, the XML which I have is similar to the below: Something 1 Something 2 --> Something 1 Something 2 I want only tags outside the comment to be read, not ones within comment – Naman Sinha Nov 29 '21 at 06:02
Your XML isn't valid... Provide a sample of XML that you're trying to parse - add to your question – Alex Ott Nov 29 '21 at 07:32
I have added the exact XML in the question. – Naman Sinha Nov 29 '21 at 08:17

How to ignore comments while reading an XML file in Pyspark Databricks?

1 Answers1