For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.
I was doing experimentation with three of these methods,
Spark: Standalone Mode
using pyspark.sql module
Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.
df3=spark.createDataFrame(pandas_df)
Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame
df3=spark.createDataFrame(RDD_list, StringType())
Method3 :Reading directly from raw data to create Spark DataFrame
df3=spark.read.text("Data/bookpage.txt")
What I have observed:
- Num of default partitions in three cases are different.
Method1:(pandas) - 8 ( I have 8 cores)
Method2:(RDD) - 2
Method3:(Direct raw read)- 1
- Conversion
Method1 : Raw Data => Pandas DF => Spark DataFrame
Method2 : Raw Data => RDD => Spark DataFrame
Method3 : Raw Data => Spark DataFrame
Questions:
- Which method is more efficient?
- As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
- For same data, there are different default partitions. Why?