0

How can I import CSV file into PySpark as a dataset? Note that I am NOT asking about how to import them into dataframes.

While reading this page from Databricks, I learned some benefits of datasets over dataframes.

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

I want to learn how to work with them instead of RDDs and dataframes.

Iterator516
  • 187
  • 1
  • 11
  • in [this document](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) you can read *The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.* – furas Oct 15 '19 at 02:19

1 Answers1

4

The linked blog post gives you the answer that it is impossible because of the python:

Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.

cronoik
  • 15,434
  • 3
  • 40
  • 78
  • Hello @cronoik isn't dataframe and dataset are quite similar in performance wise ?, It is RDD that is different from Dataframe/Datasets. OP is comparing Dataset vs Dataframe and want to use Dataset over Dataframe that is strange. That is what i understand from the link. – PIG Oct 15 '19 at 06:30
  • They are quite similiar (to be precise Dataframe is since Spark 2.0 an alias for Dataset[Row]). The main difference is that Datasets are strongly typed. You might want to look at [this](https://stackoverflow.com/a/39033308/6664872) for further explaination. – cronoik Oct 16 '19 at 21:30