Pyspark to read csv

Question

New to pyspark and would like to read csv file to dataframe. cant seem to have it read. Any help?

from pyspark.sql import SQLContext
import pyspark
from pyspark.sql import Row
import csv


sql_c = SQLContext(sc)

rdd = sc.textFile('data.csv').map(lambda line: line.split(","))

rdd.count()

Py4JJavaError Traceback (most recent call last) in () ----> 1 rdd.count()

score 2 · Accepted Answer · answered May 14 '18 at 10:04

2

If you use Spark 2, the preferred way is

df = sql_c.read.csv('data.csv')

answered May 14 '18 at 10:04

Vitalii Kotliarenko

2,947
18
26

vds619 · Answer 2 · 2018-05-14T13:10:38.330

0

To read csv independent of the spark version:

if sc.version.startswith("2"):
     csv_plugin = "csv"
else:
     csv_plugin = "com.databricks.spark.csv"

dataframe = sql_c.read.format(csv_plugin).options(header='true', inferSchema='true').load('data.csv')

Remove header='true' if you don't have a header.

edited May 14 '18 at 13:10

answered May 14 '18 at 10:10

vds619

29
2

Pyspark to read csv

2 Answers2

Linked