I'm working on a Glue ETL Job that basically reads a dataframe in Pyspark and should output data in XML Format. I've searched a lot for the solution and the code fails at the particular write statement shown below:
df.write.format('com.databricks.spark.xml').options(rowTag='book', rootTag='books').save('newbooks.xml')
The Glue Version I'm currently using is Glue 3.0 - Spark 3.1, Scala 2 and Python 3. Since I'm trying to use the Spark-XML library I have tried including the following jars as dependents in the Glue Script:
spark-xml_2.10-0.3.5,
spark-xml_2.11-0.7.0,
spark-xml_2.12-0.14.0,
spark-xml_2.13-0.14.0
The different errors I'm seeing with different versions are as follows:
An error occurred while calling o92.save. java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
An error occurred while calling o95.save. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found
An error occurred while calling o95.save. scala/$less$colon$less
I've found a similar question posted previously by someone else and tried those approaches and they don't seem to work anymore. Has someone faced a similar issue recently? If yes, can you shed some light on the resolution?