1

My dataframe looks like this:

                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729

I would like to write it with the columns 'target', 'source', 'weights', where: 'target' and 'source' are both 'id's, and 'weights' counts on how many days both the 'target' and 'source' changed price simultaneously. So it would look like:

target  source  weights
60059   60060   2
60059   60076   1   
60060   60076   1

My goal is to use this dataframe to make a networkx graph.

I have tried using groupby

df.groupby(['date','id'])['id'].unique().value_counts()
df.groupby(['date','id'])['id'].count()

and for loops (which were terrible).

I feel like I am missing a small step in the groupby, but couldn't tell what was missing.

Thank you for your help.

python_enthusiast
  • 896
  • 2
  • 7
  • 26
  • 12,177,535 lines, 24,800 individual 'id's – python_enthusiast Aug 14 '20 at 14:04
  • The problem with so many individual id's if you want all the combinations of 2 ids, you get something like 300 millions rows and I have the feeling that pandas will have trouble to perform operation like this, even storing the result with "regular" computer memory is probably too much – Ben.T Aug 14 '20 at 14:49
  • I have an alternative variable in the dataset that has 26 categories instead of 24800. Would this be doable? – python_enthusiast Aug 14 '20 at 14:50
  • sorry to bother again... would you know a way to do it for the 'id's? With the 'category' variable the network was fully connected and I can't do any further analysis. – python_enthusiast Aug 14 '20 at 16:47
  • that makes sense with only 26 categories instead of 25K ids, especially with so many dates, you get a all connected network. honestly, I`m not sure it is possible with so much data on a computer. sorry – Ben.T Aug 14 '20 at 17:28
  • Thanks anyway! I was thinking it might be possible to do the same thing you did on pyspark, but not sure. – python_enthusiast Aug 14 '20 at 17:29
  • maybe. I did some before and I know pivot exists, but the rest I'm not sure. – Ben.T Aug 14 '20 at 17:55

4 Answers4

2

The idea is to use a pivto_table first to get True if the id has a pct_change for each date

#first pivot to get True if any value of id for a date
df_ = df.pivot_table(index='id', columns='date', values='pct_change', 
                     aggfunc=any, fill_value=False)
print(df_)
date  2010-07-28 2010-07-29
id                         
60059       True       True
60060       True       True
60076       True      False

Then you can use combination from itertools to create all the possible pairs, use them to select the rows and use & operator to see where both have True on the same date, sum along the columns (get the weights column). Assign this column to the Dataframe created from both combination lists.

# get all combinations of ids
from itertools import combinations
a, b = map(list, zip(*combinations(df_.index, 2)))

res = (pd.DataFrame({'target':a, 'source':b})
         .assign(weigths=(df_.loc[a].to_numpy()
                          &df_.loc[b].to_numpy()
                         ).sum(axis=1))
      )
print(res)
   target  source  weigths
0   60059   60060        2
1   60059   60076        1
2   60060   60076        1

Note: don't forget to change the index='id' in the pivot_table with the name of your categorical column, otherwise there are a lot of chance that your computer won't be able to handle the following operations and crash

Ben.T
  • 29,160
  • 6
  • 32
  • 54
1

try this

import pandas as pd, numpy as np

ids = df.id.unique()
WeightDf = pd.DataFrame(index=ids, columns=ids)
WeightDf.loc[:, :] = 0

def weigh(ID):
    IdDates =  set(df.loc[df.id==ID].date.to_list())
    for i in ids:
        WeightDf.at[ID, i] = len(set.intersection(set(df.loc[df.id==i].date.to_list()), IdDates))
        
pd.Series(ids).apply(weigh)
print(WeightDf)

import itertools as itt
result = pd.DataFrame(columns=['Id1', 'Id2', 'Weight'])
for i1, i2 in itt.combinations(ids, 2):
    result = pd.concat([result, pd.DataFrame(data=[{'Id1':i1, 'Id2':i2,'Weight':WeightDf.loc[i1, i2]}])])

print(result)

Kuldip Chaudhari
  • 1,112
  • 4
  • 8
0

Seen a lot of variation of this use case - generating combinations

import itertools

df = pd.read_csv(io.StringIO("""                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729"""), sep="\s+")

# generate combinations of two... edge case when a group has only one member
# tuple of itself to itself
dfx = (df.groupby('date').agg({"id": lambda s: list(itertools.combinations(list(s), 2))
                               if len(list(s))>1 else [tuple(list(s)*2)]})
    .explode("id")
     .groupby("id").agg({"id":"count"})
     .rename(columns={"id":"weights"})
     .reset_index()
     .assign(target=lambda dfa: dfa["id"].apply(lambda s: s[0]),
           source=lambda dfa: dfa["id"].apply(lambda s: s[1]))
     .drop(columns="id")
)

print(dfx.to_string(index=False))

output

 weights  target  source
       2   60059   60060
       1   60059   60076
       1   60060   60076
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

This SO link ended up having a faster answer to my question, which works with a large number of id's. It was closer to the groupby + value_counts that I tried using before.

Here's the code, to make it easier for people in the future:

from itertools import combinations

def combine(batch):
    """Combine all products within one batch into pairs"""
    return pd.Series(list(combinations(set(batch), 2)))

edges = df.groupby('date')['id'].apply(combine).value_counts()

c = ['source', 'target']
L = edges.index.values.tolist()
edges = pd.DataFrame(L, columns=c).join(edges.reset_index(drop=True))

python_enthusiast
  • 896
  • 2
  • 7
  • 26