PySpark Data Frames when to use .select() Vs. .withColumn()?

Question

I'm new to PySpark and I see there are two ways to select columns in PySpark, either with ".select()" or ".withColumn()".

From what I've heard ".withColumn()" is worse for performance but otherwise than that I'm confused as to why there are two ways to do the same thing.

So when am I supposed to use ".select()" instead of ".withColumn()"?

I've googled this question but I haven't found a clear explanation.

bzu · Accepted Answer · 2022-08-13T20:20:53.213

3

Using:

df.withColumn('new', func('old'))

where func is your spark processing code, is equivalent to:

df.select('*', func('old').alias('new'))  # '*' selects all existing columns

As you see, withColumn() is very convenient to use (probably why it is available), however as you noted, there are performance implications. See this post for details: Spark DAG differs with 'withColumn' vs 'select'

edited Aug 13 '22 at 20:20

answered Aug 13 '22 at 20:14

bzu

1,242
1
8
14

score 1 · Answer 2 · answered Dec 18 '22 at 08:03

@Robert Kossendey You can use select to chain multiple withColumn() statements without suffering the performance implications of using withColumn. Likewise, there are cases where you may want/need to parameterize the columns created. You could set variables for windows, conditions, values, etcetera to create your select statement.

score 0 · Answer 3 · answered Aug 13 '22 at 19:47

0

.withColumn() is not for selecting columns, instead it returns a new DataFrame with a new / replaced column (docs).

answered Aug 13 '22 at 19:47

Robert Kossendey

6,733
2
12
42

PySpark Data Frames when to use .select() Vs. .withColumn()?

3 Answers3