1

Looking for some help on python networkx

i have a dataset of about 20k shared mailboxes and 60K email ids. 1 email id can be in multiple mailboxes. i ran network graph which basically linked all connected email ids (by mailboxes) to form clusters. for the most part i got clusters with <100 email ids. however, i end up with one big cluster of 20k+ mailboxes. i now need to break up this big cluster into smaller pieces by deleting the least number of edges. What would be a good way of identifying what those edges should be using networkx.

Below is the code i am currently using to create the network graph

    #read from excel with 2 columns 'Shared_MailBox_Name', 'email_id'
    xls = pd.ExcelFile(input_file_shared_mailbox)
    df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)

    #create network graph
    g = nx.Graph()
    g.add_edges_from(df.itertuples(index=False))
    connected_components = nx.connected_components(g)

    # Find the component id of the nodes
    node2id = {}
    for cid, component in enumerate(connected_components):
        for node in component:
            node2id[node] = cid

    df['Ring#'] = df['Shared_MailBox_Name'].map(node2id)  #Assign Cluster Number

To give some example; If the data looks like the below then i would like to know A, B, and C (and not so much D, E, and F) so i can remove A, B, C from the data set and break the big cluster into maximum number of pieces

enter image description here

A_K
  • 81
  • 2
  • 10
  • What is `df['Shared_MailBox_Name']`? Is it the same as `Mailbox_Name`? And it'd be better if you don't post data as an image, please see [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – m13op22 Apr 02 '20 at 18:14
  • 1
    Do you want to delete the edges or nodes? For the first, you can take a look at edge-betweenness-centrality (see [`edge_betweenness_centrality_subset`](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.edge_betweenness_centrality_subset.html#networkx.algorithms.centrality.edge_betweenness_centrality_subset)). For the later, you should take a look at network dismantling. – Sparky05 Apr 02 '20 at 18:17
  • @HS-nebula updated the question to show the right column name. also keeping the image for now so i can color code what i am looking for. – A_K Apr 02 '20 at 18:25
  • @Sparky05 will take a look edge_betweeness_centrality_subset..... i did look at nx.betweenness_centrality(g) but did not produce anything helpful. I want to delete the emails which ( i think) are the edges in this case – A_K Apr 02 '20 at 18:25
  • 1
    Edges that when removed increase the number of components are called [bridges](https://en.wikipedia.org/wiki/Bridge_(graph_theory)). [Here](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.bridges.bridges.html#networkx.algorithms.bridges.bridges) is the implementation in networkx. However, you might not have any bridges in your graph. In that case, you will want to look at [graph cuts](https://en.wikipedia.org/wiki/Cut_(graph_theory)), specifically the minimum cut or the sparsest cut. – Paul Brodersen Apr 06 '20 at 10:02
  • @PaulBrodersen this is helpful. any examples on the sparsest cut you can point me to in python? – A_K Apr 07 '20 at 12:17
  • Not an expert on cuts and Github pages are down for maintenance so I can't check anything. What's wrong with the minimum cuts? I do know for a fact that there are implementation for minimum cuts in networkx and igraph. – Paul Brodersen Apr 07 '20 at 13:37

0 Answers0