4

I am working to connect my spark application to DashDB. Currently, I can load my data just fine.

However, I am unable to save a DataFrame to DashDB.

Any insight will be helpful.

  var jdbcSets = sqlContext.read.format("jdbc").options(Map("url" -> url, "driver" -> driver, "dbtable" -> "setsrankval")).load()
  jdbcSets.registerTempTable("setsOpponentRanked")
  jdbcSets = jdbcSets.coalesce(10)
  sqlContext.cacheTable("setsOpponentRanked")

However, when I try to save large DataFrames, I get the error:

DB2 SQL Error: SQLCODE=-1666, SQLSTATE=42613, SQLERRMC=CLOB, DRIVER=4.19.26

The code I use to save the data is as follows:

val writeproperties = new Properties()
  writeproperties.setProperty("user", "dashXXXX")
  writeproperties.setProperty("password", "XXXXXX")
  writeproperties.setProperty("rowId", "false")
  writeproperties.setProperty("driver", "com.ibm.db2.jcc.DB2Driver")
  results.write.mode(SaveMode.Overwrite).jdbc(writeurl, "players_stat_temp", writeproperties)

An example test data set can be seen here:

println("Test set: "+results.first()) 
Test set: ['Damir DZUMHUR','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test',null,null,null,null,null,null,null]

The DataFrame schema is as follows:

    root
 |-- PLAYER: string (nullable = true)
 |-- set01: string (nullable = true)
 |-- set02: string (nullable = true)
 |-- set12: string (nullable = true)
 |-- set01weakseed: string (nullable = true)
 |-- set01medseed: string (nullable = true)
 |-- set01strongseed: string (nullable = true)
 |-- set02weakseed: string (nullable = true)
 |-- set02medseed: string (nullable = true)
 |-- set02strongseed: string (nullable = true)
 |-- set12weakseed: string (nullable = true)
 |-- set12medseed: string (nullable = true)
 |-- set12strongseed: string (nullable = true)
 |-- set01weakrank: string (nullable = true)
 |-- set01medrank: string (nullable = true)
 |-- set01strongrank: string (nullable = true)
 |-- set02weakrank: string (nullable = true)
 |-- set02medrank: string (nullable = true)
 |-- set02strongrank: string (nullable = true)
 |-- set12weakrank: string (nullable = true)
 |-- set12medrank: string (nullable = true)
 |-- set12strongrank: string (nullable = true)
 |-- minibreak: string (nullable = true)
 |-- minibreakweakseed: string (nullable = true)
 |-- minibreakmedseed: string (nullable = true)
 |-- minibreakstrongseed: string (nullable = true)
 |-- minibreakweakrank: string (nullable = true)
 |-- minibreakmedrank: string (nullable = true)
 |-- minibreakstrongrank: string (nullable = true)

I have looked at the jdbc DB2Dialect and see that the code for StringType gets mapped to CLOB. I wonder if the following will help:

private object DB2CustomDialect extends JdbcDialect {
    override def canHandle(url: String): Boolean = url.startsWith("jdbc:db2")
    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
            case StringType => Option(JdbcType("VARCHAR(10000)", java.sql.Types.VARCHAR))
            case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
            case _ => None
    }
}
Aaron B
  • 173
  • 7
  • Override keywords don't work; moreover, after removing this, I still face the same issue. I am using Spark 3 and DB2. – Glarixon Sep 27 '21 at 17:45

3 Answers3

2

Works well by adding a custom dialect.

JdbcDialects.registerDialect(new DB2CustomDialect())
Aaron B
  • 173
  • 7
1

Yes, indeed the current implementation of DB2Dialect is not optimal. We will take a look and will probably create a pull request to change StringType mapping to VARCHAR.

I guess the thought to use CLOB at the fist place was that this is safer when you want to be able to store strings of all lengths, while VARCHAR is limited by the DB2/dashDB page size at maximum. But I think it is unlikely that one will put very long strings into properties of a data frame and CLOB causes all kind of operational trouble, e.g. the fact that it is not supported in COLUMN ORGANIZED tables in DB2, which is the default table type in dashDB and that's why you face the problem when trying to write your data frame to dashDB. But CLOB also is problematic for IO performance in DB2 since it is not necessarily cached in database memory like all other table data.

What you can do as workaround for now is indeed just register an own custom dialect as you propose above with JdbcDialects.registerDialect() until the mentioned pull request will be accepted.

  • I am having exactly the same issue, but I am using PySpark. How can I fix this? – and_apo Jan 19 '17 at 09:29
  • You can apply this fix also in PySpark notebooks using the Scala bridge feature of Pixidust. I have written a blog article about the whole problem and solution with links to sample notebooks here: http://datascience.ibm.com/blog/working-with-dashdb-in-data-science-experience/ – Torsten Steinbach Jan 20 '17 at 12:03
  • I have seen this article before, I am actually using IBM's spark-submit and not Notebooks/DSX. Are you saying I need to fix my script locally and then submit it in the Spark Cluster? Will the Spark Cluster have all those dependencies installed since it is a managed service? – and_apo Jan 20 '17 at 14:48
  • 1
    I haven't verified the fix with spark-submit. But my assumption is that it will work there too. I.e. integrate the JDBC dialect registration via pixidust scala bridge in your Python script. In addition though we are working on updating Spark in Bluemix directly with the dashDB dialect update so that you won't have to do this work around in your scripts and notebooks anymore. That fix should go out shortly. So stay tuned. – Torsten Steinbach Jan 22 '17 at 07:13
  • @TorstenSteinbach any update? I just found a bunch of CLOBs using Spark 2.4. Was patch rejected? The blog post you refer to is gone – pauljohn32 Jan 13 '21 at 04:56
0

Note, DSX (i.e. datascience.ibm.com) has not fixed this issue out of the box. So deploying the custom dialect is now not required anymore when using notebooks in DSX with dashDB.