2

Goal

If withColumn doesn't already run in parallel, I need to parallelize the withColumn functionality in order to transform the values in a (large) DataFrame column much faster.

Background

I am using PySpark.

I have a format.json file that contains certain regex searches and transform operation instructions.

Here's a sample:

{
  "currency": {
    "dtype": "float64",
    "searches": [
      {
        "regex_search": "^[\\s]*\\$[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?[\\s]*$",
        "op": {
          "replace": {
            "regex": "[$,]",
            "sub": ""
          }
        }
      },
      {
        "regex_search": "^[\\s]*See Award Doc[\\s]*$",
        "op": {
          "replace": {
            "regex": "^.*$",
            "sub": ""
          }
        }
      }
    ]
  }
}

I have a UDF that essentially looks through a column and tries to match and transform the data.

# transform data so that column can be cast properly
udf_match_and_transform = f.udf(lambda data: match_and_transform(data, format_dict), StringType())
df = df.withColumn(column_name, udf_match_and_transform(df[column_name]))

The match_and_transform function essentially looks at the format dictionary and tries to match the given value to a format and then return the transform that value.

The last line uses withColumn to essentially map the UDF to the column, replacing the old column with the newly transformed column. Problem is, it seems to take a very long time in comparison to other operation I run on the columns like frequency, percentile, mean, min, max, stddev, etc.

TLDR

It seems like the withColumn function of a DataFrame does not utilize multiple cores; how can I get similar functionality but with parallelization?

Notes

I'm testing this locally on a machine with 12 cores and 16 GB RAM.

I know I can convert the DataFrame to an RDD and parallelize, but I'd like to avoid this if there's another way.

I also know that it would be faster to use a Scala/Java UDF, but I think that would still be performed with one core using withColumn?

pehr.ans
  • 109
  • 1
  • 14
  • 2
    Your biggest slowdown probably comes from [using the `udf`](https://stackoverflow.com/a/38297050/5858851) - there is a `regexp_replace` function in `pyspark.sql.functions` – pault Sep 19 '18 at 15:06
  • Would withColumn use regexp_replace in parallel? I mean, I guess I can write a SQL expression that more or less does the same thing as the UDF, and calling that on the DataFrame would be faster, right? the crux of the problem is that the column has many, many rows and I would save a lot of time if I could apply whatever function in parallel if possible. – pehr.ans Sep 19 '18 at 15:14
  • @pehr.ans - Did the issue got solved. I am into same kind of issue where I have a regex list to find and replace over a single column and I am trying to find an optimized approach. – kavin Apr 14 '20 at 01:20

0 Answers0