0

I have a large table(around 300gb) and a ram of about (50Gb), and 8 cpus.
I want to move my psql table into google cloud storage using spark and jdbc connection. very similar to:How to convert an 500GB SQL table into Apache Parquet?.

I know my connections work, because I was able to move a small table. But with large table I get memory issues. How can I optimize it?

import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import DataFrameReader

conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar:/usr/local/jar/gcs-connector-hadoop2-latest.jar")
                                   ,("spark.executor.instances", "8")
                                   ,("spark.executor.cores", "4")
                                   ,("spark.executor.memory", "1g")
                                  ,("spark.driver.memory", "6g")
                                  ,("spark.memory.offHeap.enabled","true")
                                   ,("spark.memory.offHeap.size","40g")])

sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()

sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","/home/user/analytics/gcloud_key_name.json")
sqlContext = SQLContext(sc)

url = 'postgresql://address:port/db_name'
properties = {
              'user': 'user', 
              'password': 'password'}
df_users = sqlContext.read.jdbc(
    url='jdbc:%s' % url, table='users', properties=properties
)

gcloud_path= "gs://BUCKET/users"
df_users.write.mode('overwrite').parquet(gcloud_path)

Bonus question: can I do partition now, or first I should save it as parquet then read it and repartition it?

Bonus question2: If the answer to Bonus question 1 is yes, can I do sort it now, or first I should save it as parquet then read it and repartition it?

user1871528
  • 1,655
  • 3
  • 27
  • 41
  • Possible duplicate of [How to optimize partitioning when migrating data from JDBC source?](https://stackoverflow.com/questions/52603131/how-to-optimize-partitioning-when-migrating-data-from-jdbc-source) – 10465355 Jan 11 '19 at 17:33
  • Also [Partitioning in spark while reading from RDBMS via JDBC](https://stackoverflow.com/q/43150694/10465355), [How to improve performance for slow Spark jobs using DataFrame and JDBC connection?](https://stackoverflow.com/q/32188295/10465355), [Converting mysql table to spark dataset is very slow compared to same from csv file](https://stackoverflow.com/q/42696455/10465355) and other similar questions. – 10465355 Jan 11 '19 at 17:35

0 Answers0