Normalize pyspark data frame by group

Question

I want to normalize my data frame in pyspark by group. The solution proposed here does not help, as I want to transform every column in my data frame. The code I used in python on a pandas df is the following:

df_norm = (X_df
.groupby('group')
.transform(lambda x: (x - x.min())/(x.max() - x.min()))
.fillna(0))

How can I do this in pyspark either with a df or with RDD?

Example: input:

columns = ['group', 'sensor1', 'sensor2', 'sensor3']
vals = [
    (a, 0.8, 0.02, 100),
    (a, 0.5, 0.1, 200),
    (a, 1, 0.5, 50),
    (a, 0, 0.8, 30)
    (b, 10, 1, 0)
    (b, 20, 2, 3)
    (b, 5, 4, 1)
]

desired output:

columns = ['group','sensor1', 'sensor2', 'sensor3']
vals = [
    (a, 0.8, 0, 0.4118),
    (a, 0.5, 0.1026, 1),
    (a, 1, 0.615, 0.11),
    (a, 0, 1, 0)
    (b, 0.333, 0, 0)
    (b, 1, 0.333, 1)
    (b, 0, 1, 0.333)
]

Are you open to using [`MinMaxScaler`](https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler)? — pault, Jan 09 '19 at 15:40
could you provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) with small sample data and the desired output? — pault, Jan 09 '19 at 16:22
done :) I just want to do a min-max-transform for every column in my df by group — LN_P, Jan 11 '19 at 11:28

score 4 · Accepted Answer · answered Jan 14 '19 at 14:49

I ended up doing it this way:

w = Window.partitionBy('group')
for c in cols_to_normalize:
    df = (df.withColumn('mini', F.min(c).over(w))
        .withColumn('maxi', F.max(c).over(w))
        .withColumn(c, ((F.col(c) - F.col('mini')) / (F.col('maxi') - F.col('mini'))))
        .drop('mini')
        .drop('maxi'))

score 2 · Answer 2 · answered Jan 11 '19 at 16:10

from pyspark.sql.functions import min, max
from pyspark.sql.window import Window
vals = [('a',0.8,0.02,100),('a',0.5,0.1,200),('a',1.0,0.5,50),('a',0.0,0.8,30),
        ('b',10.0,1.0,0),('b',20.0,2.0,3),('b',5.0,4.0,1)]
df = sqlContext.createDataFrame(vals,['group', 'sensor1', 'sensor2', 'sensor3'])
df.show()
+-----+-------+-------+-------+
|group|sensor1|sensor2|sensor3|
+-----+-------+-------+-------+
|    a|    0.8|   0.02|    100|
|    a|    0.5|    0.1|    200|
|    a|    1.0|    0.5|     50|
|    a|    0.0|    0.8|     30|
|    b|   10.0|    1.0|      0|
|    b|   20.0|    2.0|      3|
|    b|    5.0|    4.0|      1|
+-----+-------+-------+-------+

w = Window().partitionBy('group')
df = df.withColumn('min_sensor1',min(col('sensor1')).over(w))\
       .withColumn('max_sensor1',max(col('sensor1')).over(w))\
       .withColumn('min_sensor2',min(col('sensor2')).over(w))\
       .withColumn('max_sensor2',max(col('sensor2')).over(w))\
       .withColumn('min_sensor3',min(col('sensor3')).over(w))\
       .withColumn('max_sensor3',max(col('sensor3')).over(w))\
       .withColumn('sensor1',((col('sensor1')-col('min_sensor1'))/(col('max_sensor1')-col('min_sensor1'))))\
       .withColumn('sensor2',((col('sensor2')-col('min_sensor2'))/(col('max_sensor2')-col('min_sensor2'))))\
       .withColumn('sensor3',((col('sensor3')-col('min_sensor3'))/(col('max_sensor3')-col('min_sensor3'))))\
       .drop('min_sensor1','max_sensor1','min_sensor2','max_sensor2','min_sensor3','max_sensor3')

df.show()    
+-----+------------------+-------------------+-------------------+
|group|           sensor1|            sensor2|            sensor3|
+-----+------------------+-------------------+-------------------+
|    b|0.3333333333333333|                0.0|                0.0|
|    b|               1.0| 0.3333333333333333|                1.0|
|    b|               0.0|                1.0| 0.3333333333333333|
|    a|               0.8|                0.0| 0.4117647058823529|
|    a|               0.5|0.10256410256410256|                1.0|
|    a|               1.0| 0.6153846153846153|0.11764705882352941|
|    a|               0.0|                1.0|                0.0|
+-----+------------------+-------------------+-------------------+

I have 100 columns. Not a very practical solution. – LN_P Jan 11 '19 at 16:11 — LN_P, Jan 11 '19 at 16:11

score 0 · Answer 3 · answered Jan 11 '19 at 14:59

I am using spark 2.3.0. You can do the following:

from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType

# group function will use this schema
schema = StructType([
    StructField("group", StringType()),
    StructField("sensor1", DoubleType()),
    StructField("sensor2", DoubleType()),
    StructField("sensor3", DoubleType()),
])

@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def func(df):

    # you don't need to do this if sensor columns already are float
    df.iloc[:,1:] = df.iloc[:,1:].astype(float)

    # select column to normalize
    cols = df.columns.difference(['group'])

    # do groupby
    result = df.groupby('group')[cols].apply(lambda x: (x - x.min())/(x.max() - x.min()))

    return pd.concat([df['group'], result], axis=1)

# apply the function
df.groupby('group').apply(func)

+-----+------------------+-------------------+-------------------+
|group|           sensor1|            sensor2|            sensor3|
+-----+------------------+-------------------+-------------------+
|    b|0.3333333333333333|                0.0|                0.0|
|    b|               1.0| 0.3333333333333333|                1.0|
|    b|               0.0|                1.0| 0.3333333333333333|
|    a|               0.8|                0.0| 0.4117647058823529|
|    a|               0.5|0.10256410256410256|                1.0|
|    a|               1.0| 0.6153846153846153|0.11764705882352941|
|    a|               0.0|                1.0|                0.0|
+-----+------------------+-------------------+-------------------+

YOLO, can we use `pandas` functions this way from `spark 2.3` onwards? — cph_sto, Jan 11 '19 at 15:42
yes, you can do that: https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#grouped-aggregate — YOLO, Jan 11 '19 at 18:26

Normalize pyspark data frame by group

3 Answers3