Use of Dataframe vs. RDD in PySpark

Asked Oct 06 '22 at 15:49

Active Oct 06 '22 at 15:49

Viewed 17 times

I am relatively new into pays-ark and am using the framework for a big data analysis exercise. I believe I understand how Spark itself works, however the application of my knowledge in practical context is still lacking...

I am reading a csv dataset and am now trying to perform easy analysis (sum, count, groupby, ...) on it. However I do not know whether to transform it into a RDD cluster first or just to take the data frame. This is especially important since I want to make it scale perfectly (Currently I am using an RDD because as far as I know, only then the data is being distributed on my local machine).

df.orderBy(col("favorites").desc()).show(5)  #Dataframe

vs.

df.rdd.sortBy(lambda data: data["favorites"], ascending=False).take(5) #RDD

That's why I am really curious about your advice: What is the difference between RDDs and Dataframes and what is the best choice in my case? Thanks for your help!

asked Oct 06 '22 at 15:49

Martin Müller

Use of Dataframe vs. RDD in PySpark

0 Answers0