I have a table that has about 1 billion records. I run a query on it to essentially find duplicates. If the result of the query is 0 rows, there are no duplicates, otherwise there are. If there are duplicates, I want to write that table name to a text file. So what I am doing is
df = spark.sql("SELECT count(*) FROM table GROUP BY key1,key2,key3 HAVING count(*) > 1 LIMIT 1)
if df.count() > 0:
with open('duplicate_tables.txt','a') as file:
file.write('\n' + table)
On the df.count()
line, I get an error like java.io.IOException: No space left on device
. Is this because count()
is inefficient. I also get the same error when I try using
if len(df.head(1)) != 0:
In my query, I thought (hoped) that adding in the LIMIT 1 would help so it wouldn't have to go through hundreds and hundreds of rows, just check if it's empty or not. If I take out the count part, it works fine.
I have seen a few ways to rewrite the count statement (I have gone through How to check if spark dataframe is empty?), but so far I haven't had any luck.