So I've been scouring stackoverflow for solutions to similar problems and keep hitting walls. I am new to python and using pandas/python for ETL so forgive me if I am not describing my situation adequately.
I have two dataframes df1 looks like:
Subscriber Key OtherID AnotherID
1 'abc' '12' '23'
2 'bcd' '45' '56'
3 'abc' '12' '23'
4 'abc' '12' '23'
5 'cde' '78' '90'
6 'bcd' '45' '56'
df2 looks like:
Subscriber Key OtherID AnotherID
1 'abc' '12' '23'
2 'bcd' '45' '56'
3 'cde' '78' '90'
I am trying to return a count the number of times SubscriberKey: 'abc' occurs in the dataframe. After finding the values, I would like to append the count to another dataframe (df2) which is my first dataframe deduplicated.
It would look like this:
Subscriber Key OtherID AnotherID Total Instances
1 'abc' '12' '23' '3'
2 'bcd' '45' '56' '1'
3 'cde' '78' '90' '1'
So what I did was try use this line:
df1.groupby(['SubscriberKey']).size()
The reason I only used 'SubscriberKey' was because some rows only had that column filled out with 'OtherID' and 'AnotherID' blank.
I have also tried Series.value_count(). When I try using groupby and size() and set the value of df2['Total Instances'] to the count of occurrences, it appears that the values do not line up correctly.
For example new table looks like this:
Subscriber Key OtherID AnotherID Total Instances
1 'abc' '12' '23' '1'
2 'bcd' '45' '56' '3'
3 'cde' '78' '90' '2'
So my original thought was maybe when doing groupby, the function sorts my output automatically. I tried to check by saving the groupby'd table as a csv and realized it only prints out the count column and not the associated subscriberkey column with it.
Anyhow, does anybody have any input as to how I can achieve this? To reiterate, I wanted to essentially just add a column to df2 that returns total # of occurrences or instances within df1.
Thanks!