1

So I've been scouring stackoverflow for solutions to similar problems and keep hitting walls. I am new to python and using pandas/python for ETL so forgive me if I am not describing my situation adequately.

I have two dataframes df1 looks like:

    Subscriber Key  OtherID  AnotherID
1     'abc'           '12'    '23'
2     'bcd'           '45'    '56'
3     'abc'           '12'    '23'
4     'abc'           '12'    '23'
5     'cde'           '78'    '90'
6     'bcd'           '45'    '56'

df2 looks like:

    Subscriber Key  OtherID  AnotherID
1     'abc'           '12'    '23'
2     'bcd'           '45'    '56'
3     'cde'           '78'    '90'

I am trying to return a count the number of times SubscriberKey: 'abc' occurs in the dataframe. After finding the values, I would like to append the count to another dataframe (df2) which is my first dataframe deduplicated.

It would look like this:

    Subscriber Key  OtherID  AnotherID Total Instances
1     'abc'           '12'    '23'           '3'
2     'bcd'           '45'    '56'           '1'
3     'cde'           '78'    '90'           '1'

So what I did was try use this line:

    df1.groupby(['SubscriberKey']).size()

The reason I only used 'SubscriberKey' was because some rows only had that column filled out with 'OtherID' and 'AnotherID' blank.

I have also tried Series.value_count(). When I try using groupby and size() and set the value of df2['Total Instances'] to the count of occurrences, it appears that the values do not line up correctly.

For example new table looks like this:

    Subscriber Key  OtherID  AnotherID Total Instances
1     'abc'           '12'    '23'           '1'
2     'bcd'           '45'    '56'           '3'
3     'cde'           '78'    '90'           '2'

So my original thought was maybe when doing groupby, the function sorts my output automatically. I tried to check by saving the groupby'd table as a csv and realized it only prints out the count column and not the associated subscriberkey column with it.

Anyhow, does anybody have any input as to how I can achieve this? To reiterate, I wanted to essentially just add a column to df2 that returns total # of occurrences or instances within df1.

Thanks!

niraj
  • 17,498
  • 4
  • 33
  • 48

1 Answers1

1

You can try:

df2['Total Instances'] = df2['Subscriber Key'].map(df1['Subscriber Key'].value_counts())
niraj
  • 17,498
  • 4
  • 33
  • 48
  • 1
    Hi thank you so much! This achieved exactly what I wanted. Could you clarify if my logic is correct. Essentially you're mapping my deduped dataframe (df2) back to df1. So my first key, say 'abc' in df2, would be "tied" to every key called 'abc' in df1? And then a count is called on the mapping? Sorry, just trying to understand how your line works exactly – user6749426 Sep 24 '17 at 21:27
  • Great! glad it helped. `value_counts()` returns `Series` and you want to map them to the `df2` on `Key`. Please refer to example https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html – niraj Sep 24 '17 at 21:31
  • so am i: 1) creating a dictionary by mapping values in df1 to df2 on key 2) calling the series of matching values by using .value_counts() Per the article, it seems like im essentially making a dictionary – user6749426 Sep 24 '17 at 21:36
  • `map`: Map values of Series using input correspondence (which can be a dict, Series, or function). So, in this case `df2['Subscriber Key']` and `df1['Subscriber Key'].value_counts()` are two `Series`. So, mapping is on two series. However, it has to match with the keys, you can switch row for test example and see, it should give correct result then too. – niraj Sep 24 '17 at 21:41
  • ok thanks! ohh I see. and so another question, when running just df1['subscriber key'].value_counts(), I see that it shows the subscriber key and a count row and so it looks as though the subscriber keys are grouped when I call .value_counts()? So does map take two dataframes and "join" them on the column I choose to return a third column (the count)? Sorry if these are dumb questions – user6749426 Sep 24 '17 at 21:48
  • yes, it counts for each `subscriber key` on `value_counts()` https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html I would suggest trying what happens when you change column name `https://stackoverflow.com/a/24216489/5916727` – niraj Sep 24 '17 at 22:01