I have a function which takes a Dataframe and finds the column with the largest sum of normalized mutual information scores across all columns. However I'm running into performance issues and feel like my solution can be improved by reducing the time it takes to iterate over the dataframe and using something specific to looping over pandas dataframes. Any ideas what I can do? I've read about using iteritems()
but it's since been depreciated for python 3.x.
from sklearn.metrics.cluster import normalized_mutual_info_score as nmis
def calculate_new_mode(x: DataFrame):
max_sum = 0
for location, i in enumerate(x):
sum_rii = 0
for j in x:
if x[i].name != x[j].name:
sum_rii += nmis(x[i], x[j], average_method='arithmetic')
if sum_rii > max_sum:
max_sum, cluster_mode, ix = sum_rii, x[i], location
return x # later modified by pushing cluster_mode to first column