How to assign and use column headers in Spark?

Question

I am reading a dataset as below.

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.

_Is DataFrame way to go here ?_ - yes it is. Or `namedtuple` if you plan to use a lot of plain Python code. — zero323, Apr 13 '16 at 20:19
DataFrame is a combination of schema and rdd. Right now variable "f" in your code is a rdd. So, if you add schema to it, you will get DataFrame. Then it will have headers and column names. Also, post that you can register this data as table and do SQL on top of it. — Abhishek Anand, Apr 14 '16 at 07:44

score 11 · Answer 1 · answered Oct 24 '17 at 22:28

11

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")

answered Oct 24 '17 at 22:28

BushMinusZero

1,202
16
21

My upvote. Thank you for sharing your knowledge. It worked like a charm in my `Spark` 3.0. – nam Jul 08 '22 at 03:39

Ida · Answer 2 · 2016-04-14T13:04:15.947

Here is how to add column names using DataFrame:

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)

score 0 · Answer 3 · edited Nov 04 '18 at 20:05

f = sc.textFile("s3://test/abc.csv") <br />
header = f.first()

header will give you as below:-<br />
u'col1,col2,col3'  --> for example taking 3 columns name

head = str(header).split(",") <br />
head will give you a list<br/>
['col1','col2','col3']

fDF = f.filter(lambda row: row!=header).map(lambda x: str(x).split(",")).toDF(head)<br/>
fdF.show() <br/>

will give you header as well as the data in dataframe as required.

How to assign and use column headers in Spark?

3 Answers3

Linked