Read spark dataframe based on size(mb/gb)

Question

Please help me in this case, I want to read spark dataframe based on size(mb/gb) not in row count. Suppose i have 500 MB space left for the user in my database and user want to insert 700 MB more data, So how i can identify the table size from the Jdbc driver and also how i can read only 500 MB data from my 700Mb spark dataframe.

you cannot do that and it is not correct to do. instead, calculate avg row size and then calculate count of rows that can fit in 500M space and take() only those records. — kavetiraviteja, Sep 02 '20 at 06:10

score 0 · Answer 1 · answered Sep 02 '20 at 06:47

It is incorrect to limit the data size in a program. You should should catch the exception and show it to user. Its up-to the user to decide if he wants to increase the database size or remove unwanted data from the database.

For the above question spark has something called size estimator. I have not used it before. But chances are you wont get exact data size as it is an estimator

import org.apache.spark.util.SizeEstimator
SizeEstimator.estimate(df)

Please refer this for more information.

Thanks for your suggestion , but unfortunately it did not work in my case. — Aaryan Roy, Sep 02 '20 at 10:13

Read spark dataframe based on size(mb/gb)

1 Answers1