Looking for some help on python networkx
i have a dataset of about 20k shared mailboxes and 60K email ids. 1 email id can be in multiple mailboxes. i ran network graph which basically linked all connected email ids (by mailboxes) to form clusters. for the most part i got clusters with <100 email ids. however, i end up with one big cluster of 20k+ mailboxes. i now need to break up this big cluster into smaller pieces by deleting the least number of edges. What would be a good way of identifying what those edges should be using networkx.
Below is the code i am currently using to create the network graph
#read from excel with 2 columns 'Shared_MailBox_Name', 'email_id'
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
#create network graph
g = nx.Graph()
g.add_edges_from(df.itertuples(index=False))
connected_components = nx.connected_components(g)
# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
for node in component:
node2id[node] = cid
df['Ring#'] = df['Shared_MailBox_Name'].map(node2id) #Assign Cluster Number
To give some example; If the data looks like the below then i would like to know A, B, and C (and not so much D, E, and F) so i can remove A, B, C from the data set and break the big cluster into maximum number of pieces