Having a column with multiple types is not currently supported. However, the column contained an array of string, you could explode the array (https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode), which creates a row for each element in the array, and apply the regular expression to the new column. Example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("hello world",),
("hello madam",),
("hello sir",),
("hello everybody",),
("goodbye world",)], schema=['test'])
df = df.withColumn('test', F.array(F.col('test')))
print(df.show())
df = df.withColumn('test-exploded', F.explode(F.col('test')))
df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))
print(df.show())
Output:
+-----------------+
| test|
+-----------------+
| [hello world]|
| [hello madam]|
| [hello sir]|
|[hello everybody]|
| [goodbye world]|
+-----------------+
+-----------------+---------------+-------------------+
| test| test-exploded|test-exploded-regex|
+-----------------+---------------+-------------------+
| [hello world]| hello world| goodbye world|
| [hello madam]| hello madam| goodbye madam|
| [hello sir]| hello sir| goodbye sir|
|[hello everybody]|hello everybody| goodbye everybody|
| [goodbye world]| goodbye world| goodbye world|
+-----------------+---------------+-------------------+
And if you wanted to put the results back in an array:
df = df.withColumn('test-exploded-regex-array', F.array(F.col('test-exploded-regex')))
Output:
+-----------------+---------------+-------------------+-------------------------+
| test| test-exploded|test-exploded-regex|test-exploded-regex-array|
+-----------------+---------------+-------------------+-------------------------+
| [hello world]| hello world| goodbye world| [goodbye world]|
| [hello madam]| hello madam| goodbye madam| [goodbye madam]|
| [hello sir]| hello sir| goodbye sir| [goodbye sir]|
|[hello everybody]|hello everybody| goodbye everybody| [goodbye everybody]|
| [goodbye world]| goodbye world| goodbye world| [goodbye world]|
+-----------------+---------------+-------------------+-------------------------+
Hope this helps!
Update
Updated to include case where the array column has several strings:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("hello world", "foo"),
("hello madam", "bar"),
("hello sir", "baz"),
("hello everybody", "boo"),
("goodbye world", "bah")], schema=['test', 'test2'])
df = df.withColumn('test', F.array(F.col('test'), F.col('test2'))).drop('test2')
df = df.withColumn('id', F.monotonically_increasing_id())
print(df.show())
df = df.withColumn('test-exploded', F.explode(F.col('test')))
df = df.withColumn('test-exploded-regex', F.regexp_replace(F.col('test-exploded'), "hello", "goodbye"))
df = df.groupBy('id').agg(F.collect_list(F.col('test-exploded-regex')).alias('test-exploded-regex-array'))
print(df.show())
Output:
+--------------------+-----------+
| test| id|
+--------------------+-----------+
| [hello world, foo]| 0|
| [hello madam, bar]| 8589934592|
| [hello sir, baz]|17179869184|
|[hello everybody,...|25769803776|
|[goodbye world, bah]|25769803777|
+--------------------+-----------+
+-----------+-------------------------+
| id|test-exploded-regex-array|
+-----------+-------------------------+
| 8589934592| [goodbye madam, bar]|
| 0| [goodbye world, foo]|
|25769803776| [goodbye everybod...|
|25769803777| [goodbye world, bah]|
|17179869184| [goodbye sir, baz]|
+-----------+-------------------------+
Just drop the id
column when you're finished processing!