0

I have a postgresql database with millions records tables. I'm working with AWS Glue and I want to filter data at sql server level but when i execute toDF() the dynamic_frame fetch all the records.

DataSource0 = glueContext.create_dynamic_frame.from_catalog(
                        database = dataSourceCatalogDataBase, 
                        table_name = dataSourceCatalogTableName,
                        redshift_tmp_dir = args["TempDir"], 
                        transformation_ctx = "DataSource0"
                        )

DataSource0.toDF()

is any way to filter the data at sql server?

Thanks

I tried additional_options.

ms12
  • 531
  • 4
  • 10

1 Answers1

0

Here's one way which you can convert a Glue DynamicFrame to a Spark dataframe with the result of a SQL query.

DataSource0 = glueContext.create_dynamic_frame.from_catalog(
                        database = dataSourceCatalogDataBase, 
                        table_name = dataSourceCatalogTableName,
                        redshift_tmp_dir = args["TempDir"], 
                        transformation_ctx = "DataSource0"
                        )

DataSource0.toDF().createOrReplaceTempView("TempViewName")

df = spark.sql("select * from TempViewName where <condition>")

Related SO: How does createOrReplaceTempView work in Spark?

lsc
  • 235
  • 1
  • 9
  • 1
    Hi. what work for us is: df1 = spark.read.format("jdbc").option("url", "jdbc:postgresql://xxxxx/yyyyy").option("query", "SELECT * FROM table where pushed_at>'2023-01-03'").option("user", "xxx").option("password", "xxx").load() – Roberto Jan 12 '23 at 14:32
  • This will read as a Spark dataframe directly (different from the topic). In addition, reading DataSource0 as a DynamicFrame enables Glue features such as [job bookmark](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html). – lsc Jan 12 '23 at 16:10