3

I have a table that has about 1 billion records. I run a query on it to essentially find duplicates. If the result of the query is 0 rows, there are no duplicates, otherwise there are. If there are duplicates, I want to write that table name to a text file. So what I am doing is

df = spark.sql("SELECT count(*) FROM table GROUP BY key1,key2,key3 HAVING count(*) > 1 LIMIT 1)
if df.count() > 0:
    with open('duplicate_tables.txt','a') as file:
        file.write('\n' + table)

On the df.count() line, I get an error like java.io.IOException: No space left on device. Is this because count() is inefficient. I also get the same error when I try using

if len(df.head(1)) != 0:

In my query, I thought (hoped) that adding in the LIMIT 1 would help so it wouldn't have to go through hundreds and hundreds of rows, just check if it's empty or not. If I take out the count part, it works fine.

I have seen a few ways to rewrite the count statement (I have gone through How to check if spark dataframe is empty?), but so far I haven't had any luck.

Dávid Pásztor
  • 51,403
  • 9
  • 85
  • 116
formicaman
  • 1,317
  • 3
  • 16
  • 32

2 Answers2

1

Spark is lazy. This means, when you run spark.sql() nothing actually happens. You can see this by noticing that spark.sql() "executes" immediately, no matter the SQL complexity. The actual processing is done upon an action is required; in your case when .count() comes into play. The latter is probably causing the memory issue due to the SQL complexity and the size of the table.

Perhaps one other thing you could try is reading the entire table and having Spark to check if there are duplicates. However, given the raw size of your table, this could also lead to memory issues.

df = spark.sql("SELECT * FROM table") # or select particular column(s)
if df.count() != df.dropDuplicates().count():
    with open('duplicate_tables.txt','a') as file:
        file.write('\n' + table)
PApostol
  • 2,152
  • 2
  • 11
  • 21
  • So I just ran the query by itself in the impala-shell and check the stats: in total the peak memory seems to be ~4 GB. Where I'm running it shows as having 13 GB available. – formicaman Jun 11 '20 at 18:31
1

I found this to be the more efficient way do determine if a spark dataframe is indeed empty :

df.first() == None

after testing it with .limit(1), .head(), .count(), etc