Import CSV file as PySpark Dataset (NOT Dataframes)

Question

How can I import CSV file into PySpark as a dataset? Note that I am NOT asking about how to import them into dataframes.

While reading this page from Databricks, I learned some benefits of datasets over dataframes.

I want to learn how to work with them instead of RDDs and dataframes.

in [this document](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) you can read *The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.* — furas, Oct 15 '19 at 02:19

score 4 · Accepted Answer · answered Oct 15 '19 at 02:21

4

The linked blog post gives you the answer that it is impossible because of the python:

Note: Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.

answered Oct 15 '19 at 02:21

cronoik

Hello @cronoik isn't dataframe and dataset are quite similar in performance wise ?, It is RDD that is different from Dataframe/Datasets. OP is comparing Dataset vs Dataframe and want to use Dataset over Dataframe that is strange. That is what i understand from the link. – PIG Oct 15 '19 at 06:30
They are quite similiar (to be precise Dataframe is since Spark 2.0 an alias for Dataset[Row]). The main difference is that Datasets are strongly typed. You might want to look at [this](https://stackoverflow.com/a/39033308/6664872) for further explaination. – cronoik Oct 16 '19 at 21:30

1 Answers1