-1

I have a python-pandas df that looks like this (image shows head of df):

enter image description here

As you can see for index 265 and index 743 the columns 'symbol', 'date' and 'hour' are equal. The only difference is the group. I want to know how often this happens, to be exact how often group 'BPG' has matching events with other groups (in this case matching event with 'BPF'). How can I count those events?

I tried grouping the df with test_df.groupby(["symbol", "date", "hour", "group"]).count() receiving this:

enter image description here

But how can I access the matches now? For example turning the last column now into a list of lists, where every list included holds all 'group'-values where the columns 'symbol', 'date' and 'hour' are equal would be helpful.

I.e. turning it into: [["BPG"]["BPG"]["MPG"]["BPF", "BPG"]]

Thanks for the help!

denisRei
  • 9
  • 2
  • (Can you please include a snippet of whatever `test_df.to_dict('split')` returns so that I can reproduce `test_df`? Use [`.sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) or `.iloc` to reduce the rows if its a very large DataFrame....) Anyway, you can probably use some combination of `groupby` (but maybe list `group` at the **start** instead of end) and [`.pivot`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) and/or maybe [`.duplicated`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) – Driftr95 Apr 01 '23 at 03:49
  • Images of data are difficult to use to demonstrate a solution You should make it easier for responders by providing a minimal example as above but in usable (code) form.. – user19077881 Apr 01 '23 at 09:30
  • Thanks for the tip @Driftr95 to reproduce the dataset you can find it here: https://github.com/SystemsLab-Sapienza/pump-and-dump-dataset (I work with the pump_telegram.csv). As said I am trying to find out which groups often publish the same symbol together. I did find an ugly and unclean way in python but I'm wondering what the solution in pandas would look like. – denisRei Apr 01 '23 at 12:10
  • Thank you for this tip @user19077881 ! I am new to posing questions so I will try and fix this. – denisRei Apr 01 '23 at 12:11

1 Answers1

0

I think this question actually has a very similar solution to what you need; to generalize it a bit for your issue:

# import pandas as pd
# pnd_repo_url = 'https://github.com/SystemsLab-Sapienza/pump-and-dump-dataset'
# csv_url = f'{pnd_repo_url}/blob/master/pump_telegram.csv'
# test_df = pd.read_html(csv_url)[0][['symbol', 'group', 'date', 'hour']]

mCol = 'group'
commons_df = test_df.groupby([c for c in test_df.columns if c!=mCol]).agg({mCol:list}) 
## OR # .agg(**{mCol:(mCol,list)})

and then commons_df should look like

opdf_implode

and list(commons_df['group']) should return that list of lists you mentioned:

[['BPF'], ['LUX'], ['BPF'], ['ATW'], ['ATW'], ['BPF'], ['BPF'], ['BPF'], ['ATW'], ['ATW'], ['MPG'], ['CPI'], ['TCG', 'BPF', 'CCS'], ['ATW'], ['ATW', 'LUX'], ['CPS'], ['C4P'], ['ATW'], ['ATW', 'LUX'], ['ATW'], ['ATW'], ['BPF'], ['BPF'], ['WCG', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['CPS', 'MPG', 'TWP', 'BPF', 'CPI'], ['TWP', 'BPF'], ['CP'], ['CP'], ['CP'], ['MPG', 'BPF', 'CPI'], ['CPI'], ['BPF', 'CPI'], ['CPI'], ['CPI'], ['CPI'], ['CPI'], ['FCS'], ['CPS', 'BPF'], ['TCC'], ['FCS'], ['BPG'], ['CCB', 'MPG', 'WCG', 'BPF', 'CPI'], ['CW', 'CPI'], ['LUX', 'BPF'], ['C4P'], ['TCG'], ['FCS'], ['FCS'], ['ATW'], ['ATW'], ['CPI'], ['CPS'], ['TCC'], ['BPF'], ['CPI'], ['BPF'], ['TWP', 'BPF'], ['CW'], ['CPS'], ['FCS'], ['CPI'], ['TCC'], ['CPI'], ['CPI'], ['CPI'], ['ATW'], ['LUX'], ['CPI'], ['CPI'], ['BPF'], ['BPF'], ['CPS'], ['TCC'], ['TCC'], ['TCG'], ['ATW'], ['ATW'], ['BPF'], ['ATW'], ['MPG'], ['BPF', 'CPI'], ['ATW'], ['CPI'], ['ATW'], ['CCB'], ['ATW'], ['MPG', 'BPF', 'CPI'], ['CPI'], ['CPI'], ['ATW'], ['ATW'], ['BPF'], ['CPS'], ['ATW'], ['ATW'], ['BPS'], ['BPS'], ['ATW'], ['ATW'], ['ATW'], ['BPS', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['CPS', 'TCG', 'CPI'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['BPF', 'CW'], ['BPF'], ['BPF'], ['BPF'], ['FCS'], ['BPF'], ['ATW'], ['FCS'], ['WCG', 'BPF', 'SE'], ['BPG'], ['BPG', 'BPF'], ['ATW'], ['BPF'], ['BPF'], ['CPI'], ['ATW'], ['BPG', 'BPF'], ['BPG'], ['TCG', 'BPF', 'CCS'], ['BPF', 'CCS'], ['BPF', 'CPI'], ['TWP', 'BPF'], ['FCS'], ['C4P'], ['PKG'], ['C4P'], ['CPS'], ['BPF'], ['C4P'], ['PKG'], ['C4P'], ['CPI'], ['BPF'], ['TCC'], ['CPS'], ['ATW'], ['CPI'], ['BPF'], ['CPI'], ['PKG'], ['C4P'], ['C4P'], ['CPS'], ['CPS'], ['BPF'], ['BPF'], ['BPF'], ['BPG', 'CPI'], ['CCB'], ['CPI'], ['C4P'], ['BPF'], ['FCS'], ['BPS'], ['CPI'], ['TCC'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['ATW'], ['PKG'], ['BPS'], ['CCB', 'BPF'], ['ATW'], ['CCB'], ['WCG'], ['CPI'], ['FCS'], ['LUX'], ['BPF'], ['BPF'], ['ATW'], ['CPI'], ['CPI'], ['CPI'], ['CPI'], ['ATW'], ['ATW'], ['CPI'], ['ATW'], ['MPG', 'TCG', 'BPF', 'CW', 'CPI'], ['BPF'], ['TCC', 'BPF'], ['C4P'], ['CPS', 'BPF'], ['TCG'], ['CPS', 'BPF'], ['CPS', 'TCG', 'BPF', 'SE'], ['CPI'], ['CPI'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['C4P'], ['BPF'], ['BPF'], ['BPF'], ['BPS', 'BPF'], ['MPG', 'BPF'], ['MPG', 'WCG', 'CPI', 'CCS'], ['ATW'], ['FCS'], ['BPS'], ['BPS'], ['TCG'], ['WCG'], ['FCS'], ['BPG'], ['ATW'], ['ATW'], ['CPI'], ['ATW'], ['ATW'], ['CW'], ['CW'], ['CW'], ['BPS', 'BPF'], ['BPS'], ['CPI'], ['CPI'], ['BPF', 'CW'], ['BPF', 'CW'], ['ATW'], ['CPI'], ['CPI'], ['PKG'], ['C4P'], ['CPI'], ['BPF'], ['CPI'], ['CPS', 'BPF'], ['BPF'], ['CPI'], ['CPI'], ['CPI'], ['CPI'], ['WCG'], ['BPF', 'CPI'], ['CPI'], ['CPI'], ['CPI'], ['BPS', 'BPF'], ['TCG'], ['CPI'], ['TCG', 'BPF', 'CW', 'CPI'], ['ATW'], ['FCS'], ['FCS'], ['C4P'], ['C4P'], ['C4P'], ['C4P'], ['FCS'], ['ATW'], ['BPF'], ['BPF', 'CPI'], ['C4P'], ['CPI'], ['BPF'], ['TCC'], ['LUX'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['TCG'], ['ATW'], ['C4P', 'PKG'], ['CPI'], ['ATW'], ['CPI'], ['FCS'], ['CPI'], ['FCS'], ['TCC'], ['BPS', 'BPF'], ['BPF'], ['BPG'], ['BPG', 'BPF'], ['BPG', 'BPF'], ['CW'], ['CW', 'CPI'], ['TWP'], ['TWP', 'BPF'], ['BPS', 'BPF'], ['C4P'], ['C4P'], ['BPF'], ['ATW'], ['CPI'], ['CPI'], ['CPI'], ['ATW'], ['BPF'], ['CPS'], ['CPI'], ['CPI'], ['C4P'], ['ATW'], ['BPF'], ['CCB'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['CPI'], ['CPI'], ['C4P'], ['PKG'], ['PKG'], ['C4P'], ['C4P'], ['CPS'], ['CPS', 'BPF'], ['ATW'], ['ATW'], ['C4P'], ['C4P'], ['CPS', 'BPF'], ['ATW'], ['ATW'], ['PKG'], ['BPF'], ['CPI'], ['CW', 'CPI'], ['BPF'], ['CPS'], ['ATW'], ['BPF'], ['CPI'], ['CPI'], ['WCG', 'BPF'], ['BPF', 'CPI'], ['MPG', 'BPF', 'CPI'], ['TCG', 'CW'], ['BPF', 'CPI'], ['CPI'], ['TCC'], ['CW', 'CPI'], ['CPI'], ['CPI'], ['BPS'], ['ATW'], ['ATW'], ['BPF'], ['TCG', 'BPF', 'CW', 'CPI'], ['BPF'], ['TCG'], ['CPS', 'BPF', 'CPI'], ['CPI'], ['TWP'], ['MPG', 'BPF'], ['MPG'], ['BPF', 'CPI'], ['MPG', 'BPF', 'CW', 'CPI'], ['ATW'], ['ATW'], ['CPI'], ['CPI'], ['BPF'], ['BPG'], ['BPG', 'BPF'], ['MPG'], ['BPF', 'CPI'], ['BPF'], ['CPI'], ['ATW'], ['WCG'], ['ATW'], ['ATW'], ['ATW'], ['LUX'], ['ATW'], ['ATW'], ['ATW'], ['BPS'], ['FCS'], ['BPS'], ['TCG'], ['BPG'], ['BPG'], ['BPF'], ['MPG', 'TWP', 'BPF', 'CCS'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['C4P'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['BPF'], ['CPS'], ['ATW'], ['ATW'], ['BPF'], ['BPF'], ['CPS'], ['ATW'], ['TCC'], ['CPI'], ['BPS'], ['BPF'], ['ATW'], ['CPI'], ['CPI'], ['CPI'], ['CPI'], ['ATW'], ['BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['ATW'], ['ATW'], ['CPS', 'BPF'], ['CPI'], ['FCS'], ['ATW'], ['FCS'], ['CPS', 'BPF'], ['TCC'], ['BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['CPI'], ['ATW'], ['CPI'], ['CPI'], ['TCC'], ['CPI'], ['TCG'], ['FCS'], ['TCG'], ['FCS'], ['BPF'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['TWP', 'BPF'], ['TWP', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['CW', 'CPI'], ['C4P'], ['C4P'], ['C4P'], ['BPF'], ['ATW'], ['BPS'], ['FCS'], ['BPF'], ['TCG'], ['CPS'], ['CPS', 'BPF'], ['CPS'], ['CPS'], ['FCS'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['C4P'], ['TCG', 'BPF'], ['TWP', 'BPF'], ['BPS'], ['FCS'], ['BPS', 'BPF'], ['MPG', 'BPF', 'CPI'], ['CPI'], ['MPG', 'BPF'], ['CPS', 'BPF', 'SE', 'CPI'], ['CPI'], ['MPG', 'WCG', 'BPF'], ['TCG', 'BPF', 'CCS'], ['CP'], ['PKG'], ['MPG'], ['CPI'], ['BPS', 'CPI'], ['CPI'], ['BPF'], ['TCG'], ['TWP', 'BPF'], ['TWP', 'BPF'], ['BPS', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['CPS'], ['BPF'], ['ATW'], ['ATW'], ['BPF', 'CPI'], ['ATW'], ['C4P'], ['BPS'], ['MPG', 'WCG', 'SE', 'CPI'], ['BPF'], ['TCG', 'CCS'], ['BPF'], ['BPF', 'CCS'], ['TCG', 'BPF'], ['TCG', 'BPF'], ['CPS', 'TWP', 'BPF'], ['CP'], ['CP'], ['TWP', 'BPF'], ['CPI'], ['TCG'], ['BPS'], ['BPG'], ['BPG'], ['WCG', 'BPF', 'SE'], ['BPG', 'BPF'], ['CW'], ['CW', 'CPI'], ['TCG'], ['CPS', 'TWP', 'BPF', 'CPI'], ['CPS', 'TCG', 'BPF'], ['ATW'], ['BPS'], ['ATW'], ['TCG'], ['CPI'], ['CPI'], ['CPI'], ['BPF', 'CW'], ['TCC', 'BPF'], ['BPF'], ['BPF'], ['BPF'], ['BPF'], ['BPF'], ['CPI'], ['MPG', 'WCG', 'CPI'], ['CPS'], ['CPS', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['LUX'], ['C4P'], ['BPF'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['CPI'], ['TCC'], ['TCG'], ['C4P'], ['CPS', 'BPF'], ['WCG'], ['TCG'], ['CPI'], ['TWP'], ['BPF'], ['CPI'], ['ATW'], ['BPS', 'BPF'], ['CCB'], ['BPS', 'BPF'], ['CPI'], ['BPF', 'CW'], ['BPS', 'BPF'], ['BPF'], ['BPS'], ['BPS'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPG'], ['ATW'], ['ATW'], ['ATW'], ['TCG', 'BPF', 'CW'], ['CPI'], ['CPS', 'BPF', 'CPI'], ['CPS', 'TWP', 'CPI'], ['ATW'], ['C4P'], ['BPS'], ['CPI'], ['CPI'], ['BPF'], ['CPI'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPS'], ['CCB', 'BPF'], ['CPI'], ['CPI'], ['TCG', 'BPF'], ['CPI'], ['TCG'], ['MPG', 'BPF', 'CPI', 'CCS'], ['TCG', 'BPF', 'CPI'], ['CPS', 'TCG', 'BPF'], ['BPF'], ['ATW'], ['LUX'], ['ATW'], ['CPI'], ['CPI'], ['TWP', 'BPF'], ['BPG'], ['BPG'], ['MPG'], ['BPG', 'BPF'], ['BPF'], ['MPG', 'BPF'], ['BPG'], ['BPF', 'CPI'], ['BPG', 'BPF'], ['BPF', 'CW'], ['TCG', 'CCS'], ['CPS', 'BPF'], ['CPI'], ['CPS', 'TCG', 'BPF', 'CPI'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['CPI'], ['ATW'], ['BPS'], ['CCB'], ['ATW'], ['ATW'], ['BPF', 'CPI'], ['ATW'], ['ATW'], ['BPF'], ['CPI'], ['FCS', 'LUX'], ['BPF', 'CW'], ['CCS'], ['CPI'], ['BPF'], ['CPI'], ['ATW'], ['PKG'], ['ATW'], ['LUX'], ['ATW'], ['FCS'], ['ATW'], ['ATW'], ['ATW'], ['TCG'], ['TCG'], ['ATW'], ['BPF'], ['CPS'], ['CPS'], ['ATW'], ['BPF'], ['CPI'], ['ATW'], ['TCC'], ['TCG'], ['C4P'], ['ATW'], ['FCS'], ['BPS'], ['BPF'], ['CPI'], ['MPG', 'CCS'], ['CW'], ['CP'], ['WCG', 'BPF'], ['ATW'], ['CPI'], ['BPS', 'BPF', 'CPI'], ['CPI'], ['CPI'], ['MPG'], ['BPF'], ['ATW'], ['CPI'], ['CP'], ['ATW'], ['C4P'], ['C4P'], ['CPI'], ['TCG'], ['CPS', 'BPF'], ['ATW'], ['ATW'], ['ATW'], ['C4P'], ['BPF'], ['BPF'], ['ATW'], ['CPI'], ['C4P'], ['TCG'], ['CPI'], ['CPS'], ['CPI'], ['ATW'], ['ATW'], ['ATW'], ['TCC'], ['MPG'], ['CPI'], ['ATW'], ['ATW'], ['CCB'], ['ATW'], ['ATW'], ['BPF'], ['CPS'], ['ATW'], ['FCS', 'CCS'], ['ATW'], ['FCS'], ['BPF'], ['ATW'], ['CPI'], ['CPI'], ['CPI'], ['CPI'], ['CPS', 'TWP', 'BPF'], ['CPI'], ['TCC'], ['CPI'], ['CW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['TCC'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['CPI'], ['C4P'], ['C4P'], ['CPS'], ['WCG'], ['BPF', 'CW'], ['ATW'], ['FCS'], ['BPS'], ['TCG'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['CPI'], ['CPI'], ['ATW', 'LUX'], ['C4P'], ['ATW'], ['PKG'], ['CPI'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['ATW'], ['BPF'], ['BPF'], ['BPF'], ['BPF'], ['FCS'], ['ATW', 'LUX'], ['CPI'], ['TCC'], ['TCC'], ['CCB'], ['ATW'], ['WCG', 'BPF'], ['CPI'], ['MPG', 'BPF', 'CW'], ['CPI'], ['MPG', 'CW'], ['CPI'], ['CPI'], ['MPG', 'CCS'], ['BPF', 'CPI'], ['BPF', 'CW'], ['CPS', 'BPF'], ['TWP', 'BPF'], ['TWP', 'BPF'], ['TCG', 'BPF'], ['BPS'], ['CW'], ['CPI'], ['CPS', 'BPF'], ['FCS'], ['TCG'], ['MPG'], ['CPI'], ['BPF'], ['LUX'], ['BPF'], ['MPG'], ['BPF', 'CPI'], ['CCB'], ['ATW'], ['BPF', 'CW'], ['ATW'], ['ATW'], ['CPI'], ['ATW'], ['ATW'], ['ATW'], ['CPS'], ['C4P'], ['CPS'], ['CPS'], ['FCS'], ['FCS'], ['CPS'], ['PKG'], ['ATW'], ['ATW'], ['LUX'], ['ATW'], ['ATW'], ['ATW'], ['CPS'], ['BPF'], ['CPI'], ['ATW'], ['ATW'], ['BPS'], ['BPF'], ['BPS'], ['BPS', 'BPF'], ['BPF'], ['LUX'], ['FCS'], ['BPF'], ['BPS'], ['C4P'], ['CPI'], ['CPI'], ['CPS', 'TWP', 'BPF'], ['PKG'], ['BPF'], ['BPF'], ['BPF']]

or you could have it a dictionary with commons_df['group'].to_dict()

{('611', '2018-08-31', '18:30'): ['BPF'], ('611', '2018-08-31', '18:31'): ['LUX'], ('611', '2018-09-07', '19:00'): ['BPF'], ('8BIT', '2018-01-23', '17:30'): ['ATW'], ('8BIT', '2018-01-29', '17:30'): ['ATW'], ('ABC', '2018-06-10', '17:00'): ['BPF'], ('ABC', '2018-07-05', '19:00'): ['BPF'], ('ACOIN', '2018-05-13', '17:00'): ['BPF']} ## [CLIPPED TO FIT]

Btw, I totally misunderstood your question at first and thought you wanted this symmetric df of how many times any group intersects with (i.e., has a row with the same symbol, date and hour as) another group, which can be computed with

mSet = set(test_df[mCol])
test_df_mi = test_df.set_index([test_df[mCol],test_df.index]).drop(mCol,axis=1)
intxn = lambda m1,m2: test_df_mi.loc[m1].merge(test_df_mi.loc[m2],how='inner') 
intxn_df = pd.DataFrame({mc: {mr: len(intxn(mc,mr)) for mr in mSet} for mc in mSet})
Driftr95
  • 4,572
  • 2
  • 9
  • 21