0

i am trying to transfer a table from Postgres to Sql Server. My database is located in remote server. So i have used the following approach

import pandas as pd
from pyspark.sql import SparkSession
from sqlalchemy import create_engine

appName = "PySpark PostgreSQL Example - via psycopg2"
master = "11.220.202.11"

spark = SparkSession.builder.master(master).appName(appName).getOrCreate()

engine = create_engine(
    "postgresql+psycopg2://postgres:password@11.220.202.11/testdb?client_encoding=utf8")
pdf = pd.read_sql('select * from test_table', engine)

# Convert Pandas dataframe to spark DataFrame
df = spark.createDataFrame(pdf)

after this step i was planning to dump the data in Sql Server.

But the problem i am facing probably memory issue, as my system hangs and it has shutdown twice, i have 8 GB RAM and the table size is around 3 GB. Is there a way to directly load the data into sql server instead of loading it to pyspark dataframe. Please do suggest improvements and alternatives.

TIA

Mark
  • 23
  • 4
  • you can use the pandas dataframe `to_sql()` as mentioned [here](https://stackoverflow.com/questions/25661754/get-data-from-pandas-into-a-sql-server-with-pyodbc) – samkart Jul 21 '22 at 14:41
  • Avoid loading data directly in pandas data frame instead load in pyspark data frame using PostgreSQL JDBC Driver depending source data volume and available memory implement batching – devesh Jul 21 '22 at 15:19

0 Answers0