3

I have a pandas dataframe of the form df,

Batch_ID    Product_ID
   1            A
   1            B
   1            C
   2            B
   2            B
   2            C
   2            C
   3            B
   3            B
   3            C
   4            C
   4            D
   5            D

I would like to get an edge list from this, essentially a new dataframe edge_list_df (which I cant then convert to networkx object) of the form,

Source       Target         Weight
  A             B             1.0
  A             C             1.0
  A             D             0.0
  B             C             3.0
  B             D             0.0
  C             D             1.0

Note that I have given a number of different possibilities in the example to ensure my question is clear. For instance, the counter does not increase two times even though for Batch_ID=2 the B-C combination occurs twice.

What is the most efficient way to achieve this?

Melsauce
  • 2,535
  • 2
  • 19
  • 39

2 Answers2

5

Here's my take on it:

from itertools import combinations

def combine(batch):
    """Combine all products within one batch into pairs"""
    return pd.Series(list(combinations(set(batch), 2)))

edges = df.groupby('Batch_ID')['Product_ID'].apply(combine).value_counts()
edges
#(B, C)    3
#(A, B)    1
#(A, C)    1
#(D, C)    1

I understand that 0-occurrence edges are not really needed.

You can further split the index into the source and the target, if you want:

edges = edges.reset_index()
edges = pd.concat([edges, edges['index'].apply(pd.Series)], axis=1)
edges.drop(['index'], axis=1, inplace=True)
edges.columns = 'Weight','Source','Target'
#       Weight Source Target
#0       3      B      C
#1       1      A      B
#2       1      A      C
#3       1      D      C

Or:

c = ['Source', 'Target']
L = edges.index.values.tolist()
edges = pd.DataFrame(L, columns=c).join(edges.reset_index(drop=True))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • @jezrael Looks good to me. I knew the last block of code was a bit lousy. – DYZ Feb 11 '18 at 21:02
  • 1
    Ya, the most slowiest is `.apply(pd.Series)` - check [timings](https://stackoverflow.com/a/35491399/2901002) in the end of answer ;) – jezrael Feb 11 '18 at 21:04
2

Using NetworkX API:

In [225]: G = nx.from_pandas_edgelist(df, 'Batch_ID', 'Product_ID')

In [226]: from networkx.algorithms import bipartite

In [227]: W = bipartite.weighted_projected_graph(G, df['Product_ID'].unique())

In [228]: W.edges(data=True)
Out[228]: EdgeDataView([('A', 'C', {'weight': 1}), ('A', 'B', {'weight': 1}), ('B', 'C', {'weight': 3}), ('C', 'D', {'weight': 1})])

In [229]: nx.to_pandas_edgelist(W)
Out[229]:
  source target  weight
0      A      C       1
1      A      B       1
2      B      C       3
3      C      D       1

NOTE: for NetworkX version 1.x use from_pandas_dataframe() and to_pandas_dataframe instead of from_pandas_edgelist and to_pandas_edgelist

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419