RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

Question

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.

I was doing experimentation with three of these methods,

Spark: Standalone Mode
using pyspark.sql module

Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.

 df3=spark.createDataFrame(pandas_df)

Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame

df3=spark.createDataFrame(RDD_list, StringType())

Method3 :Reading directly from raw data to create Spark DataFrame

df3=spark.read.text("Data/bookpage.txt")

What I have observed:

Num of default partitions in three cases are different.

   Method1:(pandas) - 8 ( I have 8 cores)
   Method2:(RDD)    - 2
   Method3:(Direct raw read)- 1

Conversion

 Method1 : Raw Data => Pandas DF => Spark DataFrame
 Method2 : Raw Data => RDD => Spark DataFrame
 Method3 : Raw Data => Spark DataFrame

Questions:

Which method is more efficient?
As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
For same data, there are different default partitions. Why?

I suppose method3 is the most efficient due to less conversions (meaning less overhead). The default number of partitions are different for different methods of df creation. You can always set the number of partitions manually when reading the files. — mck, Jan 01 '21 at 07:14
Yes! @mck .I have observed DAGs for three methods. Method 3 is more efficient involves small operations (Scan text=>MapPartitions Internal). I understood that default partitions in Spark are done in same way as input splits in Hadoop. Correct me If I am wrong, in spark, default partitions are also managed based on operations(shows quite a no. of operations in DAG for method 1 & 2) from reading data to converting it to final partitioned RDDs. — BeginnerRP, Jan 01 '21 at 09:28
@mck, Regarding setting the number of partitions. I read this answer on Stackoverflow **DataFrame is a distributed data structure. It is neither required nor possible to parallelize it** [link](https://stackoverflow.com/questions/37557014/should-we-parallelize-a-dataframe-like-we-parallelize-a-seq-before-training#:~:text=DataFrame%20is%20a%20distributed%20data%20structure.%20It%20is,data%20structures%20which%20reside%20in%20the%20driver%20memory.) My understanding based on it, its' partitions are already optimized. Is it good practice to change the partitions? — BeginnerRP, Jan 01 '21 at 09:37
I think you have misunderstood the answer there. It means a dataframe is *already* parallelized, but you can set the number of partitions using `df.repartition(x)` where x is the number of partitions — mck, Jan 01 '21 at 09:38
And re. your first comment, you're right, default number of partitions depend on the operations — mck, Jan 01 '21 at 09:39
re. changing partitions, it depends on what you want to do with the dataframe. Sometimes the partitioning isnt optimized (e.g. default num partitions after join operations = 200, but you don't want to do this for tiny dataframes) — mck, Jan 01 '21 at 09:41

RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

0 Answers0