0

I am using networkx to build an email network structure from a txt file where each row represents an "edge." I first loaded the txt file (3 columns: {'#Sender', 'Recipient', 'time'}) into Python and then converted to an networkx object using the following code:

import networkx as nx
import pandas as pd
email_df = pd.read_csv('email_network.txt', delimiter = '->')
email = nx.from_pandas_dataframe(email_df, '#Sender', 'Recipient', edge_attr = 'time')

The email.txt data can be accessed here.

However, email_df (a pandas DataFrame object) has a length of 82927, while email (a Networkx object) has a length of 3251.

In [1]: len(email_df)
In [2]: 82927
In [3]: len(email.edges())
In [4]: 3251

I got really confused because even if for rows containing the same two nodes in the first two columns of email_df with the same sequence of direction (say, '1' to '2'), the third column ('time', meaning timestamped) should distinguish them from each other, hence, no replicated edges would appear. Then why does the number of edges dramatically decreased from 82927 to 3251 after I used nx.from_pandas_dataframe to read from `email_df'?

Would anyone help explain this to me?

Thank you.

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • Code windows are meant for JS code only. For all other languages, please paste your code, highlight it and hit ctrl+k to format. – cs95 Sep 19 '17 at 20:37

1 Answers1

1

Your line here is saying to take the Sender column as the source node, the Recipient column as the Target and add the time as edge attributes. So you are only creating a single (directed) edge between Sender and Recipient, and only the time of the last row will be added as an attribute of the edge.

email = nx.from_pandas_dataframe(email_df, '#Sender', 'Recipient', edge_attr = 'time')

You can only have one edge defined for a pair of nodes - you could group the dataframe before constructing your network and use the count as the weights for the edges,

edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')
Ken Syme
  • 3,532
  • 2
  • 17
  • 19
  • Thanks for your reply. Your explanation of my code is correct, but I did not intend to include only the time of the last row as 'time' attribute between a given pair of Sender and Recipient as attribute. How can I recover those missing "edges"? – Chris T. Sep 19 '17 at 20:45
  • You can only have a single edge defined between a pair of nodes - I have added an example to my answer of one way you could incorporate the missing data. – Ken Syme Sep 19 '17 at 20:51
  • Hi, I tried your code, but Python still returned a length of 3251. – Chris T. Sep 19 '17 at 20:57
  • From this [post](https://stackoverflow.com/questions/9469515/changing-edge-attributes-in-networkx-multigraph), it seems that one needs to add a `key` to uniquely identify an edge between the same pair of nodes, but is there an alternative way to do this, especially for a large dataset? (I won't be able to add a unique key for each of the 82927 rows) – Chris T. Sep 19 '17 at 21:16