Goal
If withColumn
doesn't already run in parallel, I need to parallelize the withColumn
functionality in order to transform the values in a (large) DataFrame column much faster.
Background
I am using PySpark.
I have a format.json file that contains certain regex searches and transform operation instructions.
Here's a sample:
{
"currency": {
"dtype": "float64",
"searches": [
{
"regex_search": "^[\\s]*\\$[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?[\\s]*$",
"op": {
"replace": {
"regex": "[$,]",
"sub": ""
}
}
},
{
"regex_search": "^[\\s]*See Award Doc[\\s]*$",
"op": {
"replace": {
"regex": "^.*$",
"sub": ""
}
}
}
]
}
}
I have a UDF that essentially looks through a column and tries to match and transform the data.
# transform data so that column can be cast properly
udf_match_and_transform = f.udf(lambda data: match_and_transform(data, format_dict), StringType())
df = df.withColumn(column_name, udf_match_and_transform(df[column_name]))
The match_and_transform
function essentially looks at the format dictionary and tries to match the given value to a format and then return the transform that value.
The last line uses withColumn
to essentially map the UDF to the column, replacing the old column with the newly transformed column. Problem is, it seems to take a very long time in comparison to other operation I run on the columns like frequency, percentile, mean, min, max, stddev, etc.
TLDR
It seems like the withColumn
function of a DataFrame does not utilize multiple cores; how can I get similar functionality but with parallelization?
Notes
I'm testing this locally on a machine with 12 cores and 16 GB RAM.
I know I can convert the DataFrame to an RDD and parallelize, but I'd like to avoid this if there's another way.
I also know that it would be faster to use a Scala/Java UDF, but I think that would still be performed with one core using withColumn
?