6

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.

df = df.withColumn("col3", 100**(df("col1")))*df("col2")

However, this always results in:

TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'

I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.

If I perform

results = df.map(lambda x : 100**(df("col2"))*df("col2"))

this works, but I can't append to my original data frame.

Any thoughts?

This is my first time posting, so I apologize for any formatting problems.

zero323
  • 322,348
  • 103
  • 959
  • 935
zdcheng
  • 63
  • 1
  • 1
  • 4

2 Answers2

14

Since Spark 1.4 you can usepow function as follows:

from pyspark.sql import Row
from pyspark.sql.functions import pow, col

row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()

df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()

## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## |   1|   2| 1.0|
## |   2|   3| 8.0|
## |   3|   3|27.0|
## +----+----+----+

If you use an older version a Python UDF should do the trick:

import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
zero323
  • 322,348
  • 103
  • 959
  • 935
1

Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:

from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')

df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
       .withColumn('python_pow', pow(df['n'], df['n'])) \
       .withColumn('double_star_operator', df['n'] ** df['n'])

df.show()

    +---+-----------+----------+--------------------+
    |  n|pyspark_pow|python_pow|double_star_operator|
    +---+-----------+----------+--------------------+
    |  1|        1.0|       1.0|                 1.0|
    |  2|        4.0|       4.0|                 4.0|
    |  3|       27.0|      27.0|                27.0|
    |  4|      256.0|     256.0|               256.0|
    |  5|     3125.0|    3125.0|              3125.0|
    |  6|    46656.0|   46656.0|             46656.0|
    +---+-----------+----------+--------------------+

As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:

df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
       .withColumn('python_pow', pow(2, df['n'])) \
       .withColumn('double_star_operator', 2 ** df['n'])

df.show()
   
    +---+-----------+----------+--------------------+
    |  n|pyspark_pow|python_pow|double_star_operator|
    +---+-----------+----------+--------------------+
    |  1|        2.0|       2.0|                 2.0|
    |  2|        4.0|       4.0|                 4.0|
    |  3|        8.0|       8.0|                 8.0|
    |  4|       16.0|      16.0|                16.0|
    |  5|       32.0|      32.0|                32.0|
    |  6|       64.0|      64.0|                64.0|
    +---+-----------+----------+--------------------+
    

I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).

Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).

I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.


The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.

PMHM
  • 173
  • 1
  • 3
  • 12