It seems that pyspark ML doesn't have a built-in permutation feature importance method. So I want code this up, and to do so I have to individually shuffle each column in the dataframe. I found this resource as a way to do this. However, it seems like it would be very computationally heavy for a large dataframe. Is there a better way?
For example, below is an example of how I could shuffle just the column a
in simple pyspark dataframe df
. I would then calculate model performance on df
with a
shuffled. Next I would do the same thing to shuffle b
and then calculate model performance, and so on... Is there a better to do this?
import pandas as pd
from pyspark.sql.functions import row_number, lit
# Create Pandas DF
df = pd.DataFrame({
'a': [1,5,4,3,5,7],
'b': ['a','b','a','c','d','b'],
'c': [400, 200, 150, 300, 174, 225]
})
# Convert to PySpark
df = spark.createDataFrame(df)
# Create 'index' column to join
window = Window().orderBy(lit('A'))
df = df.withColumn('index', row_number().over(window))
# Shuffle just column 'a' in a new dataframe and add 'index'
df_a = df.select('a').withColumn('rand', rand(seed=83)).orderBy('rand')\
.drop('rand')\
.withColumnRenamed('a', 'a2')\
.withColumn('index', row_number().over(window))
# Replace 'a' in df with the shuffled 'a' from df_a
df = df.join(df_a, on=['index']).drop('a').withColumnRenamed('a2', 'a').show()
+-----+---+---+---+
|index| b| c| a|
+-----+---+---+---+
| 1| a|400| 5|
| 2| b|200| 1|
| 3| d|174| 5|
| 4| c|300| 3|
| 5| b|225| 4|
| 6| a|150| 7|
+-----+---+---+---+