I want to use the Spark sql substring
function to get a substring from a string in one column row while using the length of a string in a second column row as a parameter.
How do I do this?
The following set-up is reproducible.
import pyspark
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([('prefix body suffix','suffix',)], ['a', 'b',])
Problem function returns TypeError: 'Column' object is not callable
:
df = df.withColumn('noSuffix',
F.substring(
str = F.col('a'),
pos = 1,
len = F.length('a') - F.length('b')))
The following works, but I still can't use the resulting integer in the substring
function.
df = df.withColumn('length', F.length('a') - F.length('b'))
Same problem using substring_index
function.
df = df.withColumn('outCol',
F.substring_index(
F.col('a'),
F.col('b'),
1))
Is there a way to do this without creating an rdd function, then using df.rdd.map(rddFunction).toDF()
?