1

I have a dataframe with LDA topic distribution outputs along with other demographic information as below:

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2':0.002, 'LDA_3':0.50},
                          {"department": 'engineering', 'LDA_1': 0.478, 'LDA_2':0.152, 'LDA_3':0.492},
                          {"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2':0.780, 'LDA_3':0.50},
                          {"department": "marketing", 'LDA_1': 0.352, 'LDA_2':0.052, 'LDA_3':0.20}])

enter image description here

I would like to get to the below final dataframe. How do I write a function to calculate Jenson-Shannon distance between two rows (column name containing "LDA_") that returns below data frame?

i j same_department distance_LDA
0 1          0        0.23
0 2          0        0.43
0 3          1        0.26
1 2          0        0.24
1 3          0        0.11
2 3          0        0.29

I've written code to calculate JS distance between individual pairs as below. How do I turn it into a function?

array=single_df.filter(regex='LDA').to_numpy()
distance.jensenshannon(array[0],array[1])

Then to calculate whether two people share the department, I have the code below:

def same_department(i,j):
    if i['department'] == j['department']:
        return 1
    else:
        return 0   
Alicia_2024
  • 521
  • 2
  • 8

1 Answers1

1

Let's try generating all possible row combinations, merging to make a DataFrame where comparisons can happen in the same row. Then applying row-wise the jensenshannon function based on column suffixes:

from itertools import combinations
from scipy.spatial.distance import jensenshannon
import pandas as pd

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252,
                           'LDA_2': 0.002, 'LDA_3': 0.50},
                          {"department": 'engineering', 'LDA_1': 0.478,
                           'LDA_2': 0.152, 'LDA_3': 0.492},
                          {"department": 'cooperate', 'LDA_1': 0.52,
                           'LDA_2': 0.780, 'LDA_3': 0.50},
                          {"department": "marketing", 'LDA_1': 0.352,
                           'LDA_2': 0.052, 'LDA_3': 0.20}])

# Merge the 3 LDA Columns Into A Single Column Containing a List
single_df['LDA'] = single_df.filter(regex='^LDA_.*').agg(list, axis=1)
# Get Rid Of The Original LDA_X columns
single_df = single_df.filter(regex='^(?!LDA_.*)')

# Get All Row Combinations
a, b = map(list, zip(*combinations(single_df.index, 2)))

# Merge Together
df = single_df.loc[a].reset_index().merge(
    single_df.loc[b].reset_index(),
    left_index=True,
    right_index=True,
)

# Apply jensonshannon to LDA_x and LDA_y Lists
df['distance_LDA'] = df.apply(
    lambda x: jensenshannon(x['LDA_x'], x['LDA_y']), axis=1)

# Get If In Same Department
df['same_department'] = df['department_x'].eq(df['department_y']).astype(int)

# Rename and Filter Columns
df = df \
    .rename(columns={'index_x': 'i',
                     'index_y': 'j'})[['i', 'j',
                                       'same_department',
                                       'distance_LDA']]

# For Display
print(df.to_string(index=False))

Output:

i  j  same_department  distance_LDA
0  1                0      0.235849
0  2                0      0.429508
0  3                1      0.264777
1  2                0      0.238155
1  3                0      0.112456
2  3                0      0.299704
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Thanks! Was wondering if there is a faster way to calculate J-S distance other than using the "apply()" function? My actual data frame has over 1M rows. Any suggestion would be appreciated! – Alicia_2024 May 04 '21 at 01:44
  • There may be some optimizations that can be made, but to significantly increase performance some major refactoring would need to be made. There may be some good ideas in [Performance of Pandas apply vs np.vectorize to create new column from existing columns](https://stackoverflow.com/q/52673285/15497888) or you might consider multiprocessing. – Henry Ecker May 04 '21 at 16:12