2

I have the following data frame:

enter image description here

import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"])
df = df.set_index("a")
df.groupby("a").mean()
df.groupby("a").std()

I want to standardize the dataframe for each key and NOT standardize the whole column vector.

So for the following example the output would be:

a = 1: 
  Column: b
  (2 - 2) / 0.0
  (2 - 2) / 0.0
  (2 - 2) / 0.0
  Column: c
  (3 - 2) / 1.0
  (1 - 2) / 1.0
  (2 - 2) / 1.0

And then I would get each value standardized per group

How can I do that in spark?

Thanks

Lee
  • 781
  • 2
  • 11
  • 31
  • 2
    if only that screenshot would have been code to copy and paste to try it out - wont type it from theimage – Patrick Artner Nov 12 '17 at 15:48
  • 1
    Edited - I added the code but it doesn't really matter, it's only easier to understand the problem.. – Lee Nov 12 '17 at 15:54
  • 1
    now somebody can use the code to quickly paste&copy to pyfiddle.io or smth to get you a solution :) +1 You helped the SO community to help you. – Patrick Artner Nov 12 '17 at 15:57

1 Answers1

4

With Spark DataFrame:

sdf = spark.createDataFrame(df)

Imports:

from pyspark.sql.functions import *
from pyspark.sql.window import Window

def z_score(c, w):
    return (col(c) - mean(c).over(w)) / stddev(c).over(w)

Window:

w = Window.partitionBy("a")

Solution:

sdf.select("a", z_score("b", w).alias("a"), z_score("c", w).alias("b")).show()
+---+----+----+                                                                 
|  a|   a|   b|
+---+----+----+
|  1|null| 1.0|
|  1|null|-1.0|
|  1|null| 0.0|
|  2|-1.0|null|
|  2| 0.0|null|
|  2| 1.0|null|
+---+----+----+