2

I can read data from Oracle Database in Master node using this code:

 val spark = SparkSession
            .builder
            .master("local[4]")
            .config("spark.executor.memory", "8g")
            .config("spark.executor.cores", 4)
            .config("spark.task.cpus",1)
            .appName("Spark SQL basic example")
            .config("spark.some.config.option", "some-value")
            .getOrCreate()

 val jdbcDF = spark.read
              .format("jdbc")
              .option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcldb")
              .option("dbtable", "table")
              .option("user", "orcl")
              .option("password", "********")
              .load()

Then I can repartition the Dataframe among Workers:

  val test = jdbcDF.repartition(8,col("ID_Col"))
  test.explain

My issue is that my data is huge and they cannot fit on the Master RAM. As a result of that I want each node read its own data separately. I am wondering if there is any way to read data from database in every Worker and load them to Spark Dataframe. In fact, I want to load data to Spark Dataframe in each Worker Node separately using Scala or Python.

Would you please guide me how I can do that?

Any help is really appreciated.

M_Gh
  • 1,046
  • 4
  • 17
  • 43

1 Answers1

1

With local you do not have a Resource Mgr like YARN. You have no Workers, but you can run stuff in parallel provided local[n] set suitably on the same machine with N cores.

You will not be loading to the Master if you follow advice of Alex Ott and read up.

You can improve speed of loading by using parameters lowerBound, upperBound, numPartitions when reading data with spark.read.jdbc, using the Cores instead of Executors on Workers. That is what local means and how Spark works.

If you need to partition otherwise, you need to then do a subsequent re-partition.

If you have enough memory and disk, you will be slower but it will process.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Dear @thebluephantom, thank you for your answer. I read the link according to Alex Ott advice. Would you please guide me if I can use ```lowerBound``` and ```upperBound``` in each Worker node with different values? I mean each Worker has its own ```lowerBound``` and ```upperBound```. – M_Gh Oct 17 '21 at 13:30
  • See https://medium.com/@radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3 – thebluephantom Oct 17 '21 at 13:54
  • How did it all go? – thebluephantom Oct 18 '21 at 13:58
  • the column type is String. I must distribute data base on that String column. I should convert String column to Integer, in a way that the data meaning does not change. – M_Gh Oct 18 '21 at 14:07
  • That is may be not possible. I would select another column that is int, dte or ts and then repartition. – thebluephantom Oct 18 '21 at 14:29
  • the String column is very important to distribute data base on it. So, I must find a way to convert it to Integer. – M_Gh Oct 18 '21 at 14:40
  • No, not really, as you can use something else and then re-partition. That seems to be a hard thing to follow but it ain't that bad. – thebluephantom Oct 18 '21 at 14:48