pyspark: reading orc data and writing to drop duplicates increases total disk size?

Question

Is this somehow related to?

I have a folder containing 3.3GB of orc files verified by du -sh. All files have the same schema and are very small (~128 KB). I believe they have a lot of duplicates, so I'd like to clean them up.

$ du -sh /pool/in
3.9G    /pool/in

I read this entire folder using this code.

# script.py

import pyspark.sql
import pyspark.sql.functions

instance_conf = pyspark.SparkConf()
instance_conf.set('spark.local.dir', '/pool/spark_temp')
instance_conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark = pyspark.sql.SparkSession.builder.config(conf=instance_conf).getOrCreate() # create a new sesssion at a host

dirname = "/pool/in"
dirname_out = "/pool/out"
dataframe = spark.read.orc(dirname)
dataframe_out = dataframe.dropDuplicates() # across all columns

dataframe_out.count()
dataframe_out.write.mode("overwrite").orc(dirname_out)

I called spark-submit using these settings

/usr/local/spark-3.4.1-bin-hadoop3/bin/spark-submit --driver-memory 4g --executor-memory 2g --executor-cores 2 script.py

And the output folder is now 12GB!

$ du -sh /pool/out
12G /pool/out

I was not expecting the dropDuplicates() operation to actually triple some large set of data. Are the executor cores reading the data in twice? Why is this happening?

I noticed something was off when even /pool/out/_temporary was growing in size.

Just to add more to the confusion...

The amount of files left in each directory differs

ls /pool/in | wc -l
10626
ls /pool/out | wc -l
201

The deduplication did work since each id is it's own group of data.

>>> x = spark.read.orc("/pool/out")
>>> x.count()
401149568
>>> x = spark.read.orc("/pool/in")
>>> x.count()
467271623
>>> x = spark.read.orc("/pool/out")
>>> x.select("id").distinct().count()
21174972
>>> x = spark.read.orc("/pool/in")
>>> x.select("id").distinct().count()
21174972

can you check the size with `ls -la` and look at individual file size and adding it up (ignore the total size displayed)? ref: https://unix.stackexchange.com/a/486529 — Emma, Aug 21 '23 at 14:56
@Emma Yes, they produce the same result. $ ls -lha /pool/in > wtf.txt $ head -1 wtf.txt total 3.9G $ ls -lha /pool/out total 12G — user19695124, Aug 21 '23 at 14:59
do not look at the total size because as link says, total size is same as `du` which shows all blocks size but individual file size shows the "size of file". — Emma, Aug 21 '23 at 15:01
Oh, yes from comparing the du -sh /pool/in/* and ls -lha /pool/in sizes look the same aside from a few kilobyes. Every file in the /pool/out folder is 61 MB though. — user19695124, Aug 21 '23 at 15:07
hmm, 61mb * 201 files = >12gb. hmm so in your env, `du` ~= `ls`. this is not the issue of `du` vs `ls`. The other question is in the last code block, you are doing `x.select("id").distinct().count()` but in your actual code you are doing `x.distinct().count()` amongst the all columns. Have you checked how much reduction you expect to have if you dropDuplicates for all columns? — Emma, Aug 21 '23 at 15:16
Yeah, I verified the reduction in the row count I wanted contextually. It's just strange that combining small orc files takes more disk space. — user19695124, Aug 21 '23 at 15:20

pyspark: reading orc data and writing to drop duplicates increases total disk size?

0 Answers0