1

I am loading a xml file using com.databricks.spark.xml and i want to read a tag attribute using the sql context .

XML :

<Receipt>
<Sale>
<DepartmentID>PR</DepartmentID>
<Tax TaxExempt="false" TaxRate="10.25"/>
</Sale>
</Receipt>

Loaded the file by,

val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Receipt").load("/home/user/sale.xml");
df.registerTempTable("SPtable");

Printing the Schema:

root
 |-- Sale: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- DepartmentID: long (nullable = true)
 |    |    |-- Tax: string (nullable = true)

Now i want to extract the tag attribute TaxExempt from Tax.I tried the following code and it is giving me error .

val tax =sqlContext.sql("select Sale.Tax.TaxExempt from SPtable");

Error:

org.apache.spark.sql.AnalysisException: cannot resolve 'Sale.Tax[TaxExempt]' due to data type mismatch: argument 2 requires integral type, however, 'TaxExempt' is of string type.; line 1 pos 7

Any help is highly Appreciated.

ZygD
  • 22,092
  • 39
  • 79
  • 102
prasannads
  • 609
  • 2
  • 14
  • 28

1 Answers1

5

First print schema of the dataframe, in my case it is printed like below with spark xml version 0.3.3

|-- Sale: struct (nullable = true)
|    |-- DepartmentID: string (nullable = true)
|    |-- Tax: struct (nullable = true)
|    |    |-- #VALUE: string (nullable = true)
|    |    |-- @TaxExempt: boolean (nullable = true)
|    |    |-- @TaxRate: double (nullable = true)

Then use the below query to select xml attributes, after registering the temptable

sqlContext.sql("select Sale.Tax['@TaxRate'] as TaxRate from temptable").show();

Below is the Result

| TaxRate|

+-----+

|10.25|

Starting from 0.4.1, i think the attributes by default starts with underscore(_), in this case just use _ instead of @ while querying attributes.

SanthoshPrasad
  • 1,165
  • 9
  • 11
  • Thank you . I figured out the version problem and was able to print the schema as you showed here. Your select Sale.Tax['@TaxRate'] helped me solve my problem. Thanks a lot :) – prasannads Nov 23 '16 at 11:41
  • how to fetch the same if it comes under the 'root'? – prasannads Nov 24 '16 at 08:49
  • when reading xml set attribute prefix to some fixed value using option("attributePrefix", "_") and then when selecting you can directly select the root attribute like any other element, for example, select _TaxRate from temptable – SanthoshPrasad Nov 24 '16 at 10:54