Questions tagged [spark-jdbc]

78 questions
11
votes
1 answer

Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

I'm perplexed between the behaviour of numPartitions parameter in the following methods: DataFrameReader.jdbc Dataset.repartition The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter numPartitions: the…
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
8
votes
1 answer

How to use azure-sqldb-spark connector in pyspark

I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one. I am planning to use azure-sqldb-spark connector which claims to turbo boost the…
Ajay Kumar
  • 81
  • 1
  • 3
5
votes
0 answers

How to get the base type of an array type in portable JDBC

If you have a table with a column whose type is SQL ARRAY, how do you find the base type of the array type, aka the type of the individual elements of the array type? How do you do this in vendor-agnostic pure JDBC? How do you do this without…
5
votes
1 answer

Prepared statement in spark-jdbc

I am trying to read the data from the MSSQL database using Spark jdbc with a specified offset. So the data should be loaded only after the specified timestamp which would be this offset. I tried to implement it by providing a query in jdbc…
Cassie
  • 2,941
  • 8
  • 44
  • 92
4
votes
1 answer

How spark reads from jdbc and distribute the data

I need clarity about how spark works under the hood when it comes to fetch data external databases. What I understood from spark documentation is that, if I do not mention attributes like "numPartitons","lowerBound" and "upperBound" then read via…
4
votes
1 answer

Spark JDBC: DataFrameReader fails to read Oracle table with datatype as ROWID

I am trying to read a Oracle table using spark.read.format and it works great for all tables except few tables which has any column with datatype as ROWID. Below is my Code var df = spark.read.format("jdbc"). option("url", url). …
Arghya Saha
  • 227
  • 1
  • 4
  • 17
4
votes
2 answers

Pseudocolumn in Spark JDBC

I am using a query to fetch data from MYSQL as follows: var df = spark.read.format("jdbc") .option("url", "jdbc:mysql://10.0.0.192:3306/retail_db") .option("driver" ,"com.mysql.jdbc.Driver") .option("user",…
clear sky
  • 43
  • 4
3
votes
0 answers

Apache Spark write to MySQL with JDBC connector (Write Mode: Ignore) is not performing as expected

I have my tables stored in MySQL with ID as primary key. I want to write using Spark to Mysql wherein it ignores the rows in dataframe which already exists in Mysql (based on primary key) and only writes the new set of rows. ID (PK) | Name |…
3
votes
0 answers

Can't join on jdbc tables with common column names in spark 2.3

In earlier version of spark with I had two sql tables, t1: (id, body) t2: (id, name) I could query them like: spark.read.jdbc("t1 inner join t2 on t1.id = t2.id") .selectExpr("name", "body") Which would generate the following query: …
2
votes
0 answers

How to properly use foreachBatch() method in PySpark?

I am trying to sink results processed by Structured Streaming API in Spark to PostgreSQL. I tried the following approach (somehow simplified, but hope it's clear): class Processor: def __init__(self, args): self.spark_session =…
2
votes
1 answer

Apache Spark - passing jdbc connection object to executors

I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object so there would be separate connection per…
2
votes
1 answer

Spark SQL table read error 'Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'unresolvedextractvalue''

I have written a sample java spark sql code in my local in eclipse to read data from remote databricks database table like below. I have set the hadoop_home and included spark jdbc driver too but still i am getting below error for every run. static…
2
votes
1 answer

PySpark pyspark.sql.DataFrameReader.jdbc() doesn't accept datetime type upperbound argument as document says

I found in the document for jdbc function in PySpark 3.0.1 at https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader, it says: column – the name of a column of numeric, date, or timestamp type that will be used…
syan
  • 165
  • 1
  • 10
2
votes
1 answer

Loading data from Oracle table using spark JDBC is extremely slow

I am trying to read 500 millions records from a table using spark jdbc and then performance join on that tables . When i execute a sql from sql developer it takes 25 Minutes . But when i load this using spark JDBC it takes forever last time it ran…
Atharv Thakur
  • 671
  • 3
  • 21
  • 39
2
votes
1 answer

Loading data using sparkJDBCDataset with jars not working

When using a sparkJDBCDataset to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml file under…
Weiyi Yin
  • 70
  • 5
1
2 3 4 5 6