I am relatively new into pays-ark and am using the framework for a big data analysis exercise. I believe I understand how Spark itself works, however the application of my knowledge in practical context is still lacking...
I am reading a csv dataset and am now trying to perform easy analysis (sum, count, groupby, ...) on it. However I do not know whether to transform it into a RDD cluster first or just to take the data frame. This is especially important since I want to make it scale perfectly (Currently I am using an RDD because as far as I know, only then the data is being distributed on my local machine).
df.orderBy(col("favorites").desc()).show(5) #Dataframe
vs.
df.rdd.sortBy(lambda data: data["favorites"], ascending=False).take(5) #RDD
That's why I am really curious about your advice: What is the difference between RDDs and Dataframes and what is the best choice in my case? Thanks for your help!