1

New to pyspark and would like to read csv file to dataframe. cant seem to have it read. Any help?

from pyspark.sql import SQLContext
import pyspark
from pyspark.sql import Row
import csv


sql_c = SQLContext(sc)

rdd = sc.textFile('data.csv').map(lambda line: line.split(","))

rdd.count()

Py4JJavaError Traceback (most recent call last) in () ----> 1 rdd.count()

Ajaxcbcb
  • 167
  • 2
  • 4
  • 14

2 Answers2

2

If you use Spark 2, the preferred way is

df = sql_c.read.csv('data.csv')
Vitalii Kotliarenko
  • 2,947
  • 18
  • 26
0

To read csv independent of the spark version:

if sc.version.startswith("2"):
     csv_plugin = "csv"
else:
     csv_plugin = "com.databricks.spark.csv"

dataframe = sql_c.read.format(csv_plugin).options(header='true', inferSchema='true').load('data.csv')

Remove header='true' if you don't have a header.

vds619
  • 29
  • 2