instead of fetching multiple tables using pyspark how can we execute join query using jdbc

Question

customer - c_id, c_name, c_address product - p_id, p_name, price supplier - s_id, s_name, s_address orders - o_id, c_id, p_id, quantity, time

SELECT o.o_id,
       c.c_id,
       c.c_name,
       p.p_id,
       p.p_name,
       p.price * o.quantity AS amount
FROM customer c
JOIN orders o ON o.c_id = c.c_id
JOIN product p ON p.p_id = o.p_id;

I want to execute above query without fetching 3 tables as individual data frames in pyspark and performing joins on dataframes.

darshil · Accepted Answer · 2018-03-06T12:29:33.863

You can use query in-place of table as described below

Reference PySpark Documentation

df = spark.read.jdbc(
        "url", "(query) as table", 
        properties={"user":"username", "password":"password"})

In your case it will be:

df = spark.read.jdbc("url", """
    (
        SELECT o.o_id,
            c.c_id,
            c.c_name,
            p.p_id,
            p.p_name,
            p.price * o.quantity AS amount
            FROM customer c
            JOIN orders o ON o.c_id = c.c_id
            JOIN product p ON p.p_id = o.p_id
    ) as table""", properties={"user":"username", "password":"password"})

This answer has used this type of query in place of table. Also this question is relevant in your case

instead of fetching multiple tables using pyspark how can we execute join query using jdbc

1 Answers1