0

I have a function which takes a Dataframe and finds the column with the largest sum of normalized mutual information scores across all columns. However I'm running into performance issues and feel like my solution can be improved by reducing the time it takes to iterate over the dataframe and using something specific to looping over pandas dataframes. Any ideas what I can do? I've read about using iteritems() but it's since been depreciated for python 3.x.

from sklearn.metrics.cluster import normalized_mutual_info_score as nmis

def calculate_new_mode(x: DataFrame):

    max_sum = 0

    for location, i in enumerate(x):

        sum_rii = 0
        for j in x:
            if x[i].name != x[j].name:
                sum_rii += nmis(x[i], x[j], average_method='arithmetic')

        if sum_rii > max_sum:
            max_sum, cluster_mode, ix = sum_rii, x[i], location

    return x # later modified by pushing cluster_mode to first column


mandosoft
  • 163
  • 1
  • 1
  • 8
  • 1
    It seems `nmis` is symmetric so that `nmis(a, b...) == nmis(b, a...)`. So one optimization would be to cache the results rather than doing it twice for each pair of columns. (Also not sure why you return `x` which is not modified but don't return `cluster_mode, ix` - are they globals?) – Stuart Mar 03 '20 at 02:44
  • You may be able to use `itertuples` as described in [this](https://stackoverflow.com/a/41022840/10682164) answer. – totalhack Mar 03 '20 at 02:51
  • The variables at the bottom get used later in the function to push the highest cluster_mode to the first column in the dataframe. They are not globals. – mandosoft Mar 03 '20 at 02:54
  • 1
    Are you sure `iteritems` is deprecated? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iteritems.html. You also use the similar (possibly identical?) `items`. However neither will bring a large performance gain. `itertuples` doesn't seem relevant either as you aren't iterating over rows. – Stuart Mar 03 '20 at 03:10
  • Looks like iteritems works. Must have gotten it confused with dict.iteritems as a separate library. Also found [this](https://stackoverflow.com/a/55557758/10365747) answer with good info on pandas optimizations. – mandosoft Mar 03 '20 at 03:17

0 Answers0