0

I am trying to calculate the processRate from the total count of two temp tables but I'm getting the error "Detected implicit cartesian product for INNER join between logical plans" where I am not even performing joins. I am sure this error can be resolved by restructuring the query in correct format and I need your help on it. Below is the query,

spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW final_processRate AS
SELECT
      ((a.total - b.total)/a.total))* 100 AS processRate
FROM
    (select count (*) as total from sales) a,
    (select count (*) as total from sales where status = 'PENDING') b
""")

I'm getting this error while trying to view the data using,

spark.sql("select * from processRate limit 10").show(false)

Can you please help on formatting the above query to resolve this issue and view the data of final_processRate?

mck
  • 40,932
  • 13
  • 35
  • 50
  • *"where I am not even performing joins"* - why do you think this is **not** a join? I mean, you have `FROM a,b` - what else if not a join is this? – UninformedUser Apr 05 '21 at 10:41
  • also, you can enable cross join for Spark, though a cartesian product is supposed to be slow in general:https://stackoverflow.com/a/39000050/4744359 - for Spark 3 it would be enabled by default, so you're using an older version of Spark anyways. Why? – UninformedUser Apr 05 '21 at 10:43

2 Answers2

0

You don't need subquery for this. Just use a conditional aggregation:

spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW final_processRate AS
SELECT
    ((count(*) - count(case when status='PENDING' then 1 end)) / count(*)) * 100 AS processRate
FROM sales
""")

Then you can query the temp view using:

spark.sql("select * from final_processRate")

which should give you a single number/percentage calculated above.

mck
  • 40,932
  • 13
  • 35
  • 50
  • Thanks a lot!! It worked!! Can you please explain the difference between the two queries ? It will be really helpful for me to understand the logic on deeper level. – ShyamKRaman Apr 05 '21 at 10:28
  • Your query involves an implicit cross join between a and b because you have two tables in `from`, which caused the error. – mck Apr 05 '21 at 10:32
0

I would write this as:

select avg(case when status = 'PENDING' then 0.0 else 1 end)
from sales;

This returns the proportion of rows that are not pending.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786