1

I want to use the Spark sql substring function to get a substring from a string in one column row while using the length of a string in a second column row as a parameter.

How do I do this?

The following set-up is reproducible.

import pyspark
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()    

df = spark.createDataFrame([('prefix body suffix','suffix',)], ['a', 'b',])

Problem function returns TypeError: 'Column' object is not callable:

df = df.withColumn('noSuffix',
        F.substring(
            str = F.col('a'),
            pos = 1,
            len = F.length('a') - F.length('b')))

The following works, but I still can't use the resulting integer in the substring function.

df = df.withColumn('length', F.length('a') - F.length('b'))

Same problem using substring_index function.

df = df.withColumn('outCol',
        F.substring_index( 
            F.col('a'),
            F.col('b'),
            1))

Is there a way to do this without creating an rdd function, then using df.rdd.map(rddFunction).toDF() ?

Clay
  • 2,584
  • 1
  • 28
  • 63
  • See the dupe I linked for how to do this using `pyspark.sql.functions.expr`. However, if you're just trying to remove the string in `b` from `a`, you could also use `regexp_replace`. – pault Oct 19 '18 at 18:41
  • ok, this works: `df = df.select('a','b', F.expr("substring(a, 1, (length(a) - length(b)) )").alias('noSuffix'))`. Although, not very intuitive or pretty considering all the `F.`'s and other quotes around col names have to be removed. This is an oversimplified example of my real problem and `regexp_replace` wont do. Thanks. – Clay Oct 19 '18 at 19:29

0 Answers0