Questions tagged [spark-jdbc]
78 questions
11
votes
1 answer
Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)
I'm perplexed between the behaviour of numPartitions parameter in the following methods:
DataFrameReader.jdbc
Dataset.repartition
The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter
numPartitions:
the…

y2k-shubham
- 10,183
- 11
- 55
- 131
8
votes
1 answer
How to use azure-sqldb-spark connector in pyspark
I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.
I am planning to use azure-sqldb-spark connector which claims to turbo boost the…

Ajay Kumar
- 81
- 1
- 3
5
votes
0 answers
How to get the base type of an array type in portable JDBC
If you have a table with a column whose type is SQL ARRAY, how do you find the base type of the array type, aka the type of the individual elements of the array type?
How do you do this in vendor-agnostic pure JDBC?
How do you do this without…

Joshua Maurice
- 51
- 2
5
votes
1 answer
Prepared statement in spark-jdbc
I am trying to read the data from the MSSQL database using Spark jdbc with a specified offset. So the data should be loaded only after the specified timestamp which would be this offset. I tried to implement it by providing a query in jdbc…

Cassie
- 2,941
- 8
- 44
- 92
4
votes
1 answer
How spark reads from jdbc and distribute the data
I need clarity about how spark works under the hood when it comes to fetch data external databases.
What I understood from spark documentation is that, if I do not mention attributes like "numPartitons","lowerBound" and "upperBound" then read via…

Sukanta Nath
- 41
- 3
4
votes
1 answer
Spark JDBC: DataFrameReader fails to read Oracle table with datatype as ROWID
I am trying to read a Oracle table using spark.read.format and it works great for all tables except few tables which has any column with datatype as ROWID.
Below is my Code
var df = spark.read.format("jdbc").
option("url", url).
…

Arghya Saha
- 227
- 1
- 4
- 17
4
votes
2 answers
Pseudocolumn in Spark JDBC
I am using a query to fetch data from MYSQL as follows:
var df = spark.read.format("jdbc")
.option("url", "jdbc:mysql://10.0.0.192:3306/retail_db")
.option("driver" ,"com.mysql.jdbc.Driver")
.option("user",…

clear sky
- 43
- 4
3
votes
0 answers
Apache Spark write to MySQL with JDBC connector (Write Mode: Ignore) is not performing as expected
I have my tables stored in MySQL with ID as primary key.
I want to write using Spark to Mysql wherein it ignores the rows in dataframe which already exists in Mysql (based on primary key) and only writes the new set of rows.
ID (PK) | Name |…

freezthinker
- 51
- 5
3
votes
0 answers
Can't join on jdbc tables with common column names in spark 2.3
In earlier version of spark with I had two sql tables,
t1: (id, body)
t2: (id, name)
I could query them like:
spark.read.jdbc("t1 inner join t2 on t1.id = t2.id")
.selectExpr("name", "body")
Which would generate the following query:
…

Fletcher Stump Smith
- 107
- 8
2
votes
0 answers
How to properly use foreachBatch() method in PySpark?
I am trying to sink results processed by Structured Streaming API in Spark to PostgreSQL. I tried the following approach (somehow simplified, but hope it's clear):
class Processor:
def __init__(self, args):
self.spark_session =…

papi
- 23
- 4
2
votes
1 answer
Apache Spark - passing jdbc connection object to executors
I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object so there would be separate connection per…

Suparn Lele
- 23
- 3
2
votes
1 answer
Spark SQL table read error 'Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'unresolvedextractvalue''
I have written a sample java spark sql code in my local in eclipse to read data from remote databricks database table like below. I have set the hadoop_home and included spark jdbc driver too but still i am getting below error for every run.
static…

Sai Karthik N
- 21
- 1
- 4
2
votes
1 answer
PySpark pyspark.sql.DataFrameReader.jdbc() doesn't accept datetime type upperbound argument as document says
I found in the document for jdbc function in PySpark 3.0.1 at
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader, it says:
column – the name of a column of numeric, date, or timestamp type that
will be used…

syan
- 165
- 1
- 10
2
votes
1 answer
Loading data from Oracle table using spark JDBC is extremely slow
I am trying to read 500 millions records from a table using spark jdbc and then performance join on that tables .
When i execute a sql from sql developer it takes 25 Minutes .
But when i load this using spark JDBC it takes forever last time it ran…

Atharv Thakur
- 671
- 3
- 21
- 39
2
votes
1 answer
Loading data using sparkJDBCDataset with jars not working
When using a sparkJDBCDataset to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml file under…

Weiyi Yin
- 70
- 5