Q: Since I don't want to write my own parser, my question is, is there
some simple way of parsing this in Scala/Spark using some library?
AFAIK there is no such api. you have to map and parse (clean special characters in it) it. transform in to multiple columns.
I tried in the below way... but your xml showing as corrupt record from dataframe.
Further pointer :https://github.com/databricks/spark-xml
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{SQLContext, SparkSession}
/**
* Created by Ram Ghadiyaram
*/
object SparkXmlWithDtd {
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val sqlContext = new SQLContext(sc)
val str =
"""
|<!DOCTYPE lewis SYSTEM "lewis.dtd">
|
|<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="19419" NEWID="3001">
|<DATE> 9-MAR-1987 04:58:41.12</DATE>
|<TOPICS><D>money-fx</D></TOPICS>
|<PLACES><D>uk</D></PLACES>
|<PEOPLE></PEOPLE>
|<ORGS></ORGS>
|<EXCHANGES></EXCHANGES>
|<COMPANIES></COMPANIES>
|<UNKNOWN>
|RM
|f0416reute
|b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095</UNKNOWN>
|<TEXT>
|<TITLE>U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG</TITLE>
|<DATELINE> LONDON, March 9 - </DATELINE><BODY>The Bank of England said it forecast a
|shortage of around 250 mln stg in the money market today.
| Among the factors affecting liquidity, it said bills
|maturing in official hands and the treasury bill take-up would
|drain around 1.02 billion stg while below target bankers'
|balances would take out a further 140 mln.
| Against this, a fall in the note circulation would add 345
|mln stg and the net effect of exchequer transactions would be
|an inflow of some 545 mln stg, the Bank added.
| REUTER
|</BODY></TEXT>
|</REUTERS>
""".stripMargin
val f = new File("sgmtest.sgm")
FileUtils.writeStringToFile(f, str)
val xml_df = spark.read.
format("com.databricks.spark.xml")
.option("rowTag", "REUTERS")
.load(f.getAbsolutePath)
xml_df.printSchema()
xml_df.createOrReplaceTempView("XML_DATA")
spark.sql("SELECT * FROM XML_DATA").show(false)
xml_df.show(false)
}
}
Result :
root
|-- _corrupt_record: string (nullable = true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
9-MAR-1987 04:58:41.12
money-fx
uk
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095
U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG
LONDON, March 9 - The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
9-MAR-1987 04:58:41.12
money-fx
uk
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095
U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG
LONDON, March 9 - The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+