Filtering Data about Salary in pySpark

Question

I've been struggling for so long with this, and I'd be happy if someone could help me find a resolve the next issue.

I have this table:

+-----+---+------------+
|index|  Salary        |
+-----+---+------------+
|    1|  200 - 300 PA. |
|    2|  400 PA.       |
|    3|  100 - 200 PA. |
|    4|  700 - 800 PA. |
+-----+---+-----+------+

The salary column is String Type. I want to replace each String in the Salary, with the average of the range in contains (if no range, just the number) so the data will be numeric and not String. Want to create this table:

+-----+---+-------+
|index|  Salary   |
+-----+---+-------+
|    1|  250      |
|    2|  400      |
|    3|  150      |
|    4|  750      |
+-----+---+--+----+

I tried doing it by first creating an array of the Salary so it looks like this:

 ["100", "-", "300", "PA."] -

so I could extract the number from the whole string. I tried this but it looks bad and it's not working:

curr = outDF.rdd.map(lambda rec: rec[:]).map(lambda rec : rec[0])
curr  = curr.map(lambda t : (t[1], t[3])).toDF()
new_df = curr.withColumn("_1", custProdSpending["_1"].cast(IntegerType()))

Can you please provide a nicely formatted table that shows the input DataFrame and the desired output? It's hard to understand exactly what you're trying to solve without a sample input dataset with representative data. — Powers, Jul 24 '20 at 14:46
Thanks, [heres how you can include the input data](https://stackoverflow.com/a/48427186/1125159) in your question. I'm not going to take the time to look through your image and figure out how to create your representative DataFrame, but if you provide me with the code snippet, I'll be happy to help you out ;) — Powers, Jul 25 '20 at 12:34

score 0 · Answer 1 · answered Jul 26 '20 at 01:47

Here's how we can create the DataFrame.

df = spark.createDataFrame([(1, '200 - 300 PA.'), (2, '400 PA.')], ['index', 'Salary'])

+-----+-------------+
|index|       Salary|
+-----+-------------+
|    1|200 - 300 PA.|
|    2|      400 PA.|
+-----+-------------+

Let's make a UDF that extracts all the integers from a string and use the aggregate higher order function that was introduced in Spark 2.4 to average all the integers in the arrays.

from pyspark.sql.functions import *

@F.udf(returnType=ArrayType(IntegerType()))
def regexp_extract_all_integer(s, regexp):
     return None if s == None else list(map(lambda a_string: int(a_string), re.findall(regexp, s)))

query = """aggregate(
    `{col}`,
    CAST(0.0 AS double),
    (acc, x) -> acc + x,
    acc -> acc / size(`{col}`)
) AS  `avg_{col}`""".format(col="all_numbers")

actual_df = df.withColumn(
    "all_numbers",
    regexp_extract_all_integer(F.col("str"), F.lit(r'(\d+)'))
)
actual_df.selectExpr("*", query).show()

+-------------+-----------+---------------+
|          str|all_numbers|avg_all_numbers|
+-------------+-----------+---------------+
|200 - 300 PA.| [200, 300]|          250.0|
|      400 PA.|      [400]|          400.0|
|         null|       null|           null|
+-------------+-----------+---------------+

I will abstract the regexp_extract_all code to quinn, so you don't need to maintain that. Sorry for the messy code - the PySpark API makes your particular question difficult to answer.

score 0 · Answer 2 · answered Feb 06 '22 at 06:01

Use regex to clean, trun cleaned column into array and mean using reduce.Reduce does the same thing with aggregate. Code below

actual_df = (df.withColumn("all_numbers",regexp_replace(F.col("Salary"), '\D[^\-]',''))#Replace all non alphanumeric characters except '-' with nothing
               .withColumn("all_numbers",split(col("all_numbers"),"\s"))#Create an array by spliting by space
               .withColumn('avg_all_numbers',F.expr("reduce(all_numbers, cast(0 as double), (x, i) -> x +i, x -> x / size(all_numbers))"))#Compute avarage using reduce
            )
actual_df.show()

+-----+-------------+-----------+---------------+
|index|       Salary|all_numbers|avg_all_numbers|
+-----+-------------+-----------+---------------+
|    1|200 - 300 PA.| [200, 300]|          250.0|
|    2|      400 PA.|      [400]|          400.0|
+-----+-------------+-----------+---------------+

Filtering Data about Salary in pySpark

2 Answers2