How to load data at Spark Dataframe in each Worker to prevent loading huge data to Master node

Question

I can read data from Oracle Database in Master node using this code:

 val spark = SparkSession
            .builder
            .master("local[4]")
            .config("spark.executor.memory", "8g")
            .config("spark.executor.cores", 4)
            .config("spark.task.cpus",1)
            .appName("Spark SQL basic example")
            .config("spark.some.config.option", "some-value")
            .getOrCreate()

 val jdbcDF = spark.read
              .format("jdbc")
              .option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcldb")
              .option("dbtable", "table")
              .option("user", "orcl")
              .option("password", "********")
              .load()

Then I can repartition the Dataframe among Workers:

  val test = jdbcDF.repartition(8,col("ID_Col"))
  test.explain

My issue is that my data is huge and they cannot fit on the Master RAM. As a result of that I want each node read its own data separately. I am wondering if there is any way to read data from database in every Worker and load them to Spark Dataframe. In fact, I want to load data to Spark Dataframe in each Worker Node separately using Scala or Python.

Would you please guide me how I can do that?

Any help is really appreciated.

Does this answer your question? [Why spark is slower when compared to sqoop , when it comes to jdbc?](https://stackoverflow.com/questions/56288481/why-spark-is-slower-when-compared-to-sqoop-when-it-comes-to-jdbc) — Alex Ott, Oct 16 '21 at 17:14
Do you want each worker to read in data in an effort to speed up reading, or just so there is a copy of the data on each worker? If the latter, check out `broadcast`. — Jarrod Baker, Oct 16 '21 at 17:30
Dear @JarrodBaker thank you for your answer, I do want each worker to read data to speed up reading; moreover, I have huge data that all of them cannot be in one node ( I mean Master here). As long as, I know, Master read data and use ```broadcast``` to share data among nodes. It is not what I want. — M_Gh, Oct 17 '21 at 08:22
@thebluephantom, if you mean if all data load to the Master node, it is bad design. I know that and want to load data via each Worker. — M_Gh, Oct 17 '21 at 09:08
Data is read by just one worker by default, if you need to paralellize, see the link — Alex Ott, Oct 17 '21 at 09:12
Dear @AlexOtt, what link do you mean? Can we change that data read by all Workers? — M_Gh, Oct 17 '21 at 09:32

score 1 · Answer 1 · answered Oct 17 '21 at 12:14

1

With local you do not have a Resource Mgr like YARN. You have no Workers, but you can run stuff in parallel provided local[n] set suitably on the same machine with N cores.

You will not be loading to the Master if you follow advice of Alex Ott and read up.

You can improve speed of loading by using parameters lowerBound, upperBound, numPartitions when reading data with spark.read.jdbc, using the Cores instead of Executors on Workers. That is what local means and how Spark works.

If you need to partition otherwise, you need to then do a subsequent re-partition.

If you have enough memory and disk, you will be slower but it will process.

answered Oct 17 '21 at 12:14

thebluephantom

16,458
8
40
83

Dear @thebluephantom, thank you for your answer. I read the link according to Alex Ott advice. Would you please guide me if I can use ```lowerBound``` and ```upperBound``` in each Worker node with different values? I mean each Worker has its own ```lowerBound``` and ```upperBound```. – M_Gh Oct 17 '21 at 13:30
See https://medium.com/@radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3 – thebluephantom Oct 17 '21 at 13:54
How did it all go? – thebluephantom Oct 18 '21 at 13:58
the column type is String. I must distribute data base on that String column. I should convert String column to Integer, in a way that the data meaning does not change. – M_Gh Oct 18 '21 at 14:07
That is may be not possible. I would select another column that is int, dte or ts and then repartition. – thebluephantom Oct 18 '21 at 14:29
the String column is very important to distribute data base on it. So, I must find a way to convert it to Integer. – M_Gh Oct 18 '21 at 14:40
No, not really, as you can use something else and then re-partition. That seems to be a hard thing to follow but it ain't that bad. – thebluephantom Oct 18 '21 at 14:48

How to load data at Spark Dataframe in each Worker to prevent loading huge data to Master node

1 Answers1