read local csv file in pySpark (2.3)

Question

I'm using pySpark 2.3, trying to read a csv file that looks like that:

0,0.000476517230863068,0.0008178378961061477
1,0.0008506156837329876,0.0008467260987257776

But it doesn't work:

from pyspark import sql, SparkConf, SparkContext
print (sc.applicationId)
>> <property at 0x7f47583a5548>
data_rdd = spark.textFile(name=tsv_data_path).filter(x.split(",")[0] != 1)

And I get an error:

AttributeError: 'SparkSession' object has no attribute 'textFile'

Any idea how I should read it in pySpark 2.3?

Ryan Widmaier · Accepted Answer · 2018-07-11T15:17:02.300

First, textFile exists on the SparkContext (called sc in the repl), not on the SparkSession object (called spark in the repl).

Second, for CSV data, I would recommend using the CSV DataFrame loading code, like this:

df = spark.read.format("csv").load("file:///path/to/file.csv")

You mentioned in comments needing the data as an RDD. You are going to have significantly better performance if you can keep all of your operations on DataFrames instead of RDDs. However, if you need to fall back to RDDs for some reason you can do it like the following:

rdd = df.rdd.map(lambda row: row.asDict())

Doing this approach is better than trying to load it with textFile and parsing the CSV data yourself. If you use the DataFrame CSV loading then it will properly handle all the CSV edge cases for you like quoted fields. Also if only needed some of the columns, you could filter on the DataFrame before converting it to a RDD to avoid needing to bring all that extra data over into the python interpreter.

Why do you specifically need an RDD? DataFrames will give you much better performance if you are using python. I'll update the example with how to convert to an RDD if you really want to do that. — Ryan Widmaier, Jul 11 '18 at 15:12

read local csv file in pySpark (2.3)

1 Answers1

Linked