2

customer - c_id, c_name, c_address product - p_id, p_name, price supplier - s_id, s_name, s_address orders - o_id, c_id, p_id, quantity, time

SELECT o.o_id,
       c.c_id,
       c.c_name,
       p.p_id,
       p.p_name,
       p.price * o.quantity AS amount
FROM customer c
JOIN orders o ON o.c_id = c.c_id
JOIN product p ON p.p_id = o.p_id;

I want to execute above query without fetching 3 tables as individual data frames in pyspark and performing joins on dataframes.

desaiankitb
  • 992
  • 10
  • 17

1 Answers1

6

You can use query in-place of table as described below

Reference PySpark Documentation

df = spark.read.jdbc(
        "url", "(query) as table", 
        properties={"user":"username", "password":"password"})

In your case it will be:

df = spark.read.jdbc("url", """
    (
        SELECT o.o_id,
            c.c_id,
            c.c_name,
            p.p_id,
            p.p_name,
            p.price * o.quantity AS amount
            FROM customer c
            JOIN orders o ON o.c_id = c.c_id
            JOIN product p ON p.p_id = o.p_id
    ) as table""", properties={"user":"username", "password":"password"})

This answer has used this type of query in place of table. Also this question is relevant in your case

darshil
  • 131
  • 4