I have a dataframe with a column of text and a column of keywords.
>>> main_df.head(3)
+-------+-----------------------------------------+---------------------------------------+
| Index | Text | Keywords |
+-------+-----------------------------------------+---------------------------------------+
| 1 | "Here is some text" | ["here","text"] |
| 2 | "Some red birds and blue elephants" | ["red", "bird", "blue", "elephant"] |
| 3 | "Please help me with my pandas problem" | ["help", "pandas", "problem"] |
+-------+-----------------------------------------+---------------------------------------+
I use itertools.combinations to make a dataframe with all possible combinations of keywords.
>>> edge_df.head(3)
+-------+--------+--------+
| Index | Src | Dst |
+-------+--------+--------+
| 1 | "here" | "text" |
| 2 | "here" | "red" |
| 3 | "here" | "bird" |
+-------+--------+--------+
I then apply a function that goes through each keyword pair and assigns a value in edge_df['weight']
which is how many times each keyword pair appear in the same piece of text/list of keywords.
>>> edge_df.head(3)
+-------+--------+--------+--------+
| Index | Src | Dst | Weight |
+-------+--------+--------+--------+
| 1 | "here" | "text" | 1 |
| 2 | "here" | "red" | 3 |
| 3 | "here" | "bird" | 8 |
+-------+--------+--------+--------+
My problem is that the code is very slow at the moment (1hr for 300 lines of short pieces of text). Below is the code I am using to get the edge_df and apply the function. Anything I can do to speed this up?
from itertools import combinations
def indexes_by_word(word1, word2):
"""
Find the matching texts between two words.
"""
indx1 = set(df[df['Keywords'].apply(lambda lst: word1 in lst)].index)
indx2 = set(df[df['Keywords'].apply(lambda lst: word2 in lst)].index)
return len(indx1.intersection(indx2))
# Make list of all unique words
unique_words = df['Keywords'].apply(pd.Series).stack().reset_index(drop=True).unique()
# Make an empty edgelist dataframe of our words
edges = pd.DataFrame(data=list(combinations(unique_words, 2)),
columns=['src', 'dst'])
edges['weight'] = edges.progress_apply(lambda x: indexes_by_word(x['src'], x['dst']), axis=1)
edges.head()