Though answering late, I have made a working solution.
Concept
First all, you need to already have exported files in a directory (Azure Blob Container). These files will not have the logic by file size (in my case, I have used a temp folder). We need to read these files, apply the logic that I have given in the code below and regenerate the files once again with the expected file size you need.
Code logic
The 1st parameter to the method should be number of records in the files that we are going to read. 2nd parameter is the expected size per file in MB.
temp_path
is where I have kept the exported csv files. target_path
is where I will be exporting the files with file size logic. I then read the file size using dbutils
and then find out how many records should be there in a file to fit the expected size per file in MB.
After that, I read the files in and store in a dataframe df_temp
. In the last line, make sure you keep the repartition
value to 1
. The maxRecordsPerFile
tells how many records should be there in a file to fit the expected size per file in MB (which we had calculated earlier).The reason why we are using maxRecordsPerFile
is that, we doesn't know with how many records it can generate a file with a size that we are expecting. Using the method which I wrote below, it can determine this logic and we can export the file with these many number of records per file.
from functools import reduce
from operator import add
def split_files_by_size(total_records,file_size_in_mb):
temp_path = '/mnt/exported_files/temp'
target_path = '/mnt/exported_files'
csv_files = dbutils.fs.ls(temp_path)
files_size = [file.size for file in csv_files if file.name.endswith('.csv')]
records_per_file = int(total_records/ ((reduce(add, files_size) / 1024 / 1024)/file_size_in_mb))
df_temp = spark.read.option('header','true').option("delimiter",",").option("quoteAll", "true").option("quote", "\"").format("csv").load(temp_path)
df_temp.repartition(1).write.option("maxRecordsPerFile", records_per_file).option('header','true').option("delimiter",",").option("quoteAll", "true").option("quote", "\"").mode("overwrite").csv(target_path)
To execute this, use the below example code.
split_files_by_size(50000,50)
I have the files generated like below. In my case, I had total 50k records with 87 MB as the actual file and I had to generate 2 files out of it - a 50 MB file and a 37 MB file.

In case you need two files with equal size, you can change the repartition
value to 2
.
df_temp.repartition(2).write.option("maxRecordsPerFile", records_per_file).option('header','true').option("delimiter",",").option("quoteAll", "true").option("quote", "\"").mode("overwrite").csv(target_path)

The drawback with this solution is that it has to write the files an extra time, but we don't have any other way as of now.