Construct data frame using groupby

Question

My dataframe looks like this:

                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729

I would like to write it with the columns 'target', 'source', 'weights', where: 'target' and 'source' are both 'id's, and 'weights' counts on how many days both the 'target' and 'source' changed price simultaneously. So it would look like:

target  source  weights
60059   60060   2
60059   60076   1   
60060   60076   1

My goal is to use this dataframe to make a networkx graph.

I have tried using groupby

df.groupby(['date','id'])['id'].unique().value_counts()
df.groupby(['date','id'])['id'].count()

and for loops (which were terrible).

I feel like I am missing a small step in the groupby, but couldn't tell what was missing.

Thank you for your help.

The problem with so many individual id's if you want all the combinations of 2 ids, you get something like 300 millions rows and I have the feeling that pandas will have trouble to perform operation like this, even storing the result with "regular" computer memory is probably too much — Ben.T, Aug 14 '20 at 14:49
I have an alternative variable in the dataset that has 26 categories instead of 24800. Would this be doable? — python_enthusiast, Aug 14 '20 at 14:50
sorry to bother again... would you know a way to do it for the 'id's? With the 'category' variable the network was fully connected and I can't do any further analysis. — python_enthusiast, Aug 14 '20 at 16:47
that makes sense with only 26 categories instead of 25K ids, especially with so many dates, you get a all connected network. honestly, I`m not sure it is possible with so much data on a computer. sorry — Ben.T, Aug 14 '20 at 17:28
Thanks anyway! I was thinking it might be possible to do the same thing you did on pyspark, but not sure. — python_enthusiast, Aug 14 '20 at 17:29
maybe. I did some before and I know pivot exists, but the rest I'm not sure. — Ben.T, Aug 14 '20 at 17:55

score 2 · Answer 1 · answered Aug 14 '20 at 15:08

The idea is to use a pivto_table first to get True if the id has a pct_change for each date

#first pivot to get True if any value of id for a date
df_ = df.pivot_table(index='id', columns='date', values='pct_change', 
                     aggfunc=any, fill_value=False)
print(df_)
date  2010-07-28 2010-07-29
id                         
60059       True       True
60060       True       True
60076       True      False

Then you can use combination from itertools to create all the possible pairs, use them to select the rows and use & operator to see where both have True on the same date, sum along the columns (get the weights column). Assign this column to the Dataframe created from both combination lists.

# get all combinations of ids
from itertools import combinations
a, b = map(list, zip(*combinations(df_.index, 2)))

res = (pd.DataFrame({'target':a, 'source':b})
         .assign(weigths=(df_.loc[a].to_numpy()
                          &df_.loc[b].to_numpy()
                         ).sum(axis=1))
      )
print(res)
   target  source  weigths
0   60059   60060        2
1   60059   60076        1
2   60060   60076        1

Note: don't forget to change the index='id' in the pivot_table with the name of your categorical column, otherwise there are a lot of chance that your computer won't be able to handle the following operations and crash

Kuldip Chaudhari · Answer 2 · 2020-08-14T16:55:30.723

1

try this

import pandas as pd, numpy as np

ids = df.id.unique()
WeightDf = pd.DataFrame(index=ids, columns=ids)
WeightDf.loc[:, :] = 0

def weigh(ID):
    IdDates =  set(df.loc[df.id==ID].date.to_list())
    for i in ids:
        WeightDf.at[ID, i] = len(set.intersection(set(df.loc[df.id==i].date.to_list()), IdDates))
        
pd.Series(ids).apply(weigh)
print(WeightDf)

import itertools as itt
result = pd.DataFrame(columns=['Id1', 'Id2', 'Weight'])
for i1, i2 in itt.combinations(ids, 2):
    result = pd.concat([result, pd.DataFrame(data=[{'Id1':i1, 'Id2':i2,'Weight':WeightDf.loc[i1, i2]}])])

print(result)

edited Aug 14 '20 at 16:55

answered Aug 14 '20 at 15:15

Kuldip Chaudhari

1,112
4
8

This creates a matrix, not really the columns target, source, weights. – python_enthusiast Aug 14 '20 at 16:34
Thanks! I was wondering if it would be possible to do it with the 'id' variable that has 24800 observations? I don't know if using combinations is possible with such a large number. Do you think it might be? – python_enthusiast Aug 14 '20 at 17:03
It will be possible because, you might have 24800 observations but, You will less **unique IdS** – Kuldip Chaudhari Aug 14 '20 at 17:59

score 0 · Answer 3 · answered Aug 14 '20 at 15:24

Seen a lot of variation of this use case - generating combinations

import itertools

df = pd.read_csv(io.StringIO("""                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729"""), sep="\s+")

# generate combinations of two... edge case when a group has only one member
# tuple of itself to itself
dfx = (df.groupby('date').agg({"id": lambda s: list(itertools.combinations(list(s), 2))
                               if len(list(s))>1 else [tuple(list(s)*2)]})
    .explode("id")
     .groupby("id").agg({"id":"count"})
     .rename(columns={"id":"weights"})
     .reset_index()
     .assign(target=lambda dfa: dfa["id"].apply(lambda s: s[0]),
           source=lambda dfa: dfa["id"].apply(lambda s: s[1]))
     .drop(columns="id")
)

print(dfx.to_string(index=False))

output

 weights  target  source
       2   60059   60060
       1   60059   60076
       1   60060   60076

score 0 · Accepted Answer · answered Aug 17 '20 at 19:00

This SO link ended up having a faster answer to my question, which works with a large number of id's. It was closer to the groupby + value_counts that I tried using before.

Here's the code, to make it easier for people in the future:

from itertools import combinations

def combine(batch):
    """Combine all products within one batch into pairs"""
    return pd.Series(list(combinations(set(batch), 2)))

edges = df.groupby('date')['id'].apply(combine).value_counts()

c = ['source', 'target']
L = edges.index.values.tolist()
edges = pd.DataFrame(L, columns=c).join(edges.reset_index(drop=True))

Construct data frame using groupby

4 Answers4