UCanAccess keeps repeating column names as data

Question

I was using UcanAccess 5.0.1 in databricks 9.1LTS (Spark 3.1.2, Scala 2.1.2), and for whatever reasons when I use the following code to read in a single record Access db table it keeps treating the column names as the record itself (I've tried adding more records and got the same results.)

The Access db table looks like this (2 records):

ID  Field1  Field2
2   key1    column2
3   key2    column2-1

The python code looks like this:

connectionProperties = {
  "driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
 
url = "jdbc:ucanaccess:///dbfs/mnt/internal/Temporal/Database1.accdb"
df = spark.read.jdbc(url=url, table="Table1", properties=connectionProperties)

And the result looks like this:

df.printSchema()
df.count()

root
 |-- Field1: string (nullable = true)
 |-- Field2: string (nullable = true)
 |-- ID: string (nullable = true)
Out[21]: 2

df.show()
+------+------+---+
|Field1|Field2| ID|
+------+------+---+
|Field1|Field2| ID|
|Field1|Field2| ID|
+------+------+---+

Any idea/suggestion?

score 1 · Answer 1 · answered Jun 25 '22 at 02:55

If your data has column names in a first row, you can try header = True, to set first row as column headers.

Sample code –

df = spark.read.jdbc( url = url, table="Table1", header = true, properties= connectionProperties)

But if your data does not have column headers, you need to explicitly define column names and then assign them as column headers.

Sample code –

columns = ["column_name_1"," column_name_2"," column_name_3"]
df = spark.read.jdbc( url = url, table="Table1”, schema=columns, properties= connectionProperties)

You can also refer this answer by Alberto Bonsanto

Reference - https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#read-data-from-jdbc

score 0 · Accepted Answer · edited Jun 28 '22 at 02:24

turns out that there was a bug in the jdbc code ([https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name])

I added the following code and now the ucanaccess driver works fine:

%scala
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.jdbc.JdbcDialects
private case object HiveDialect extends JdbcDialect {
  override def canHandle(url : String): Boolean = url.startsWith("jdbc:ucanaccess")
  override def quoteIdentifier(colName: String): String = {
    colName.split('.').map(part => s"`$part`").mkString(".")
  }
}

JdbcDialects.registerDialect(HiveDialect)

Then display(df) would show

|Field1 |Field2 |ID     |
|:------|:------|:----- |
|key1   |column2 |  2|
|key2   |column2-1| 3|

UCanAccess keeps repeating column names as data

2 Answers2