Check if pyspark dataframe is empty causing memory issues

Question

I have a table that has about 1 billion records. I run a query on it to essentially find duplicates. If the result of the query is 0 rows, there are no duplicates, otherwise there are. If there are duplicates, I want to write that table name to a text file. So what I am doing is

df = spark.sql("SELECT count(*) FROM table GROUP BY key1,key2,key3 HAVING count(*) > 1 LIMIT 1)
if df.count() > 0:
    with open('duplicate_tables.txt','a') as file:
        file.write('\n' + table)

On the df.count() line, I get an error like java.io.IOException: No space left on device. Is this because count() is inefficient. I also get the same error when I try using

if len(df.head(1)) != 0:

In my query, I thought (hoped) that adding in the LIMIT 1 would help so it wouldn't have to go through hundreds and hundreds of rows, just check if it's empty or not. If I take out the count part, it works fine.

I have seen a few ways to rewrite the count statement (I have gone through How to check if spark dataframe is empty?), but so far I haven't had any luck.

score 1 · Answer 1 · answered Jun 11 '20 at 18:11

Spark is lazy. This means, when you run spark.sql() nothing actually happens. You can see this by noticing that spark.sql() "executes" immediately, no matter the SQL complexity. The actual processing is done upon an action is required; in your case when .count() comes into play. The latter is probably causing the memory issue due to the SQL complexity and the size of the table.

Perhaps one other thing you could try is reading the entire table and having Spark to check if there are duplicates. However, given the raw size of your table, this could also lead to memory issues.

df = spark.sql("SELECT * FROM table") # or select particular column(s)
if df.count() != df.dropDuplicates().count():
    with open('duplicate_tables.txt','a') as file:
        file.write('\n' + table)

So I just ran the query by itself in the impala-shell and check the stats: in total the peak memory seems to be ~4 GB. Where I'm running it shows as having 13 GB available. — formicaman, Jun 11 '20 at 18:31

score 1 · Answer 2 · answered Aug 20 '21 at 16:13

1

I found this to be the more efficient way do determine if a spark dataframe is indeed empty :

df.first() == None

after testing it with .limit(1), .head(), .count(), etc

answered Aug 20 '21 at 16:13

Joao Cotrim

11
3

Check if pyspark dataframe is empty causing memory issues

2 Answers2