0

We load data from oracle into a dataset: like this.

val dataset = sqlContext.read.format("jdbc").options(Map(
  "driver" -> applicationConfig.getString("oracle.driver"),
  "url" -> applicationConfig.getString("oracle.url"),
  "user" -> applicationConfig.getString("oracle.user"),
  "password" -> applicationConfig.getString("oracle.password"),
  "dbtable" -> query
)).load().as[CaseClass]

CaseClass looks like:

case class CaseClass (
  RELNR: Long = null,
  INS_CONTACTHIST_DATE: Timestamp = null,
  CONTACTDATETIME: Timestamp = null,
  CONTACTSTATUSID: Long = null,
  ...

I want to create a new DataSet[CaseClass]

import sqlContext.implicits._
val acc = sqlContext.createDataset[CaseClass](Seq())

and fill it in a few iterations with filtered data from dataset:

 val possibilities = dataset.filter(c => predicate(c))
 acc.union(possibilities)

This fails with an error: unresolved operator 'Union; From SO I learned that this has to do with incompatible datasets, and doing a printSchema() on both datasets confirms that some columntypes are incompatible:

Oracle:
|-- RELNR: decimal(10,0) (nullable = true)
|-- INS_CONTACTHIST_DATE: date (nullable = true)
|-- CONTACTDATETIME: timestamp (nullable = true)
|-- CONTACTSTATUSID: decimal(19,0) (nullable = true)

empty dataset:
|-- RELNR: long (nullable = true)
|-- INS_CONTACTHIST_DATE: timestamp (nullable = true)
|-- CONTACTDATETIME: timestamp (nullable = true)
|-- CONTACTSTATUSID: long (nullable = true)

How can I make the union work? or how can I force the population by the sqlcontext.read(..) to use the CaseClass' property types?

dr jerry
  • 9,768
  • 24
  • 79
  • 122
  • Why don't you change your case class instead of forcing the population by the sqlContext? – Ramesh Maharjan May 30 '17 at 11:54
  • @RameshMaharjan It has crossed my mind by I don't want to modify too much in the existing code, I was hoping for other, less intrusive solutions. – dr jerry May 30 '17 at 12:46
  • can you post your `CaseClass` as you posted your `ContactHistoryExtended`? – Ramesh Maharjan May 30 '17 at 12:49
  • The CaseClass was meant as some kind of obfuscation of our business logic. I modified the code, The case class itself is much longer but the disclosed attributes nicely and completely demonstrate the problem. – dr jerry May 30 '17 at 13:10
  • I tried to force the schema with `.schema(StructType(Array( StructField("RELNR", LongType, false), StructField("INS_CONTACTHIST_DATE", BigDecimalType, false), StructField("CONTACTDATETIME", TimestampType, false), StructField("CONTACTSTATUSID", LongType, false)))).load()` but that fails because of ` jdbc does not allow user-specified schemas.;` I have the feeling that @RameshMaharjan solution is the only one left :-( – dr jerry Jun 06 '17 at 08:39

2 Answers2

1

I tried to create the empty dataset with the case class you provided in your question

case class CaseClass (
                       RELNR: Long = null,
                       INS_CONTACTHIST_DATE: Timestamp = null,
                       CONTACTDATETIME: Timestamp = null,
                       CONTACTSTATUSID: Long = null
                     )

And I tried

import sqlContext.implicits._
val acc = sqlContext.createDataset[CaseClass](Seq())
acc.printSchema()

But unfortunately I got following error

Error:(246, 38) an expression of type Null is ineligible for implicit conversion
                       RELNR: Long = null,
Error:(246, 38) type mismatch;
 found   : Null(null)
 required: Long
                       RELNR: Long = null,
Error:(249, 48) an expression of type Null is ineligible for implicit conversion
                       CONTACTSTATUSID: Long = null
Error:(249, 48) type mismatch;
 found   : Null(null)
 required: Long
                       CONTACTSTATUSID: Long = null

Then I tried

case class CaseClass (
                       RELNR: Decimal = null,
                       INS_CONTACTHIST_DATE: java.sql.Date = null,
                       CONTACTDATETIME: Timestamp = null,
                       CONTACTSTATUSID: Decimal = null
                     )

This worked with following schema

root
 |-- RELNR: decimal(38,18) (nullable = true)
 |-- INS_CONTACTHIST_DATE: date (nullable = true)
 |-- CONTACTDATETIME: timestamp (nullable = true)
 |-- CONTACTSTATUSID: decimal(38,18) (nullable = true)

This matches with your Oracle schema

Oracle:
|-- RELNR: decimal(10,0) (nullable = true)
|-- INS_CONTACTHIST_DATE: date (nullable = true)
|-- CONTACTDATETIME: timestamp (nullable = true)
|-- CONTACTSTATUSID: decimal(19,0) (nullable = true)

So union should be possible then.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
0

This SO answer basically nails it down. I modified the creation of the empty dataset like:

sqlContext.createDataset[CaseClass](Seq()).selectExpr(
  "ROW_ID",
  "cast (RELNR as Decimal(10,0)) RELNR",
  ...
).as[CaseClass]
dr jerry
  • 9,768
  • 24
  • 79
  • 122