Here is how to add column names using DataFrame:
Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:
f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])
Suppose the data has 3 columns:
data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]
Now, you can specify the column names when transferring this RDD to DataFrame using toDF()
:
df_withcol = data_rdd.toDF(['height','color','width'])
df_withcol.printSchema()
root
|-- height: string (nullable = true)
|-- color: string (nullable = true)
|-- width: string (nullable = true)
If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:
df_default = data_rdd.toDF()
df_default.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)