How best to shuffle columns in pyspark to calculate permutation feature importance

Question

It seems that pyspark ML doesn't have a built-in permutation feature importance method. So I want code this up, and to do so I have to individually shuffle each column in the dataframe. I found this resource as a way to do this. However, it seems like it would be very computationally heavy for a large dataframe. Is there a better way?

For example, below is an example of how I could shuffle just the column a in simple pyspark dataframe df. I would then calculate model performance on df with a shuffled. Next I would do the same thing to shuffle b and then calculate model performance, and so on... Is there a better to do this?

import pandas as pd
from pyspark.sql.functions import row_number, lit

# Create Pandas DF
df = pd.DataFrame({
  'a': [1,5,4,3,5,7],
  'b': ['a','b','a','c','d','b'],
  'c': [400, 200, 150, 300, 174, 225]
})

# Convert to PySpark
df = spark.createDataFrame(df)

# Create 'index' column to join
window = Window().orderBy(lit('A'))
df = df.withColumn('index', row_number().over(window))

# Shuffle just column 'a' in a new dataframe and add 'index'
df_a = df.select('a').withColumn('rand', rand(seed=83)).orderBy('rand')\
  .drop('rand')\
  .withColumnRenamed('a', 'a2')\
  .withColumn('index', row_number().over(window))

# Replace 'a' in df with the shuffled 'a' from df_a
df = df.join(df_a, on=['index']).drop('a').withColumnRenamed('a2', 'a').show()

+-----+---+---+---+
|index|  b|  c|  a|
+-----+---+---+---+
|    1|  a|400|  5|
|    2|  b|200|  1|
|    3|  d|174|  5|
|    4|  c|300|  3|
|    5|  b|225|  4|
|    6|  a|150|  7|
+-----+---+---+---+

score 2 · Answer 1 · answered Jan 05 '22 at 10:02

Spark dataframes are unordered, so these types of transformations will always be expensive.

You might want to consider converting into pandas to do the shuffle part then convert back again to pyspark:

import numpy as np

pdf = df.toPandas()

pdf["a"] = np.random.permutation(pdf["a"].values)
pdf["b"] = np.random.permutation(pdf["b"].values)

df1 = spark.createDataFrame(pdf)

df1.show()
#+---+---+---+
#|  a|  b|  c|
#+---+---+---+
#|  3|  b|400|
#|  4|  a|200|
#|  5|  d|150|
#|  5|  c|300|
#|  1|  b|174|
#|  7|  a|225|
#+---+---+---+

How best to shuffle columns in pyspark to calculate permutation feature importance

1 Answers1