10

I am reading a dataset as below.

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.

Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
GoldenPlatinum
  • 427
  • 2
  • 4
  • 12
  • 1
    _Is DataFrame way to go here ?_ - yes it is. Or `namedtuple` if you plan to use a lot of plain Python code. – zero323 Apr 13 '16 at 20:19
  • DataFrame is a combination of schema and rdd. Right now variable "f" in your code is a rdd. So, if you add schema to it, you will get DataFrame. Then it will have headers and column names. Also, post that you can register this data as table and do SQL on top of it. – Abhishek Anand Apr 14 '16 at 07:44

3 Answers3

11

The solution to this question really depends on the version of Spark you are running. Assuming you are on Spark 2.0+ then you can read the CSV in as a DataFrame and add columns with toDF which is good for transforming a RDD to a DataFrame OR adding columns to an existing data frame.

filename = "/path/to/file.csv"
df = spark.read.csv(filename).toDF("col1","col2","col3")
BushMinusZero
  • 1,202
  • 16
  • 21
  • My upvote. Thank you for sharing your knowledge. It worked like a charm in my `Spark` 3.0. – nam Jul 08 '22 at 03:39
6

Here is how to add column names using DataFrame:

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)
Ida
  • 2,919
  • 3
  • 32
  • 40
0
f = sc.textFile("s3://test/abc.csv") <br />
header = f.first()

header will give you as below:-<br />
u'col1,col2,col3'  --> for example taking 3 columns name

head = str(header).split(",") <br />
head will give you a list<br/>
['col1','col2','col3']

fDF = f.filter(lambda row: row!=header).map(lambda x: str(x).split(",")).toDF(head)<br/>
fdF.show() <br/>

will give you header as well as the data in dataframe as required.

MBT
  • 21,733
  • 19
  • 84
  • 102