1

Want to find the number of unique dates corresponds to a set of values. If the values of Col1, Col2, Col3 are same then how many instances are there. I could do with only year or month or day, but want to combine all so that I can find for every unique date(yyyy/mm/dd).

BldgID  BldgHt  Device  Date
108     28      760     2018/9/18
108     28      760     2018/10/5
108     28      760     2018/10/18
108     28      760     2018/10/19
104     36      758     2018/10/2
104     36      758     2018/11/2
104     41      758     2018/9/13
104     41      758     2018/10/3
104     41      758     2018/10/16
104     41      758     2018/10/25
104     41      758     2018/11/3
104     41      758     2018/11/7
104     45      758     2018/10/3
104     45      758     2018/11/3

By changing the dt.month to dt.year or dt.day I can get year, month or day wise, but want the entire date wise.

df_test = df.Date.apply(lambda dt: dt.date()).groupby\
([df.BldgID, df.BldgHt, df.Device, df.Date.apply(lambda dt:\    
  dt.month)]).nunique()

df_test.head(10)

Here is the month-wise values.

 BldgID  BldgHt  Device   Date
 108     28.0    760      9       1
                          10      3
 104     36.0    758      10      1
                          11      1
         41.0    758      9       1
                 758      10      3
                 758      11      2
         45.0    758      10      1
         45.0    758      11      1

 Name: Date, dtype: int64

Desired output:

BldgID  BldgHt  Device   Total_count
108     28.0    760           4
104     36.0    758           2
        41.0    758           6
104     45.0    758           2

Total_count is the sum up the counts for every corresponding set of unique 'BldgID, BldgHt, Device'.

Thanks in advance.

Fakir
  • 139
  • 11

3 Answers3

4

The follwing soloution works for me. Let's generate your data first:

df = pd.DataFrame({'BldgID': [108, 108, 108, 108, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104], 'BldgHt': [28, 28, 28, 28, 36, 36, 41, 41, 41, 41, 41, 41, 45, 45], 'Device': [760, 760, 760, 760, 758, 758,  758, 758, 758, 758, 758, 758, 758, 758], 'Date': ['2018/9/18', '2018/10/5', '2018/10/18', '2018/10/19', '2018/10/2', '2018/11/2', '2018/9/13', '2018/10/3', '2018/10/16', '2018/10/25', '2018/11/3', '2018/11/7', '2018/10/3', '2018/11/3',]})
df['Date'] = df['Date'].apply(lambda x: pd.to_datetime(x))

Now you can do the following:

df2 = df.groupby(['BldgID', 'BldgHt', 'Device'])['Date'].nunique()

which gives:

print(df2)

BldgID  BldgHt  Device
104     36      758       2
        41      758       6
        45      758       2
108     28      760       4

EDIT:

If you'd like to rename the column as specified in your question, you can achieve it as follows:

df2 = df2.reset_index(drop=False).rename(columns={'Date': 'Total_count'})

or more elegantly:

df2 = df2.reset_index(name='Total_count')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
bexi
  • 1,186
  • 5
  • 9
3

to add to the already great answers

df.groupby(['BldgID','BldgHt','Device'])['Date'].count()
  • Thanks. That is what I wanted. – Fakir Aug 29 '19 at 08:39
  • This is answer works well for your given sample. Just note that this will not give the desired result if there are duplicate dates in a group. – bexi Aug 29 '19 at 08:46
  • You mean, even if the ['BldgID','BldgHt','Device'] is different but the 'Date' is same, it won't work? Then it might be a problem. – Fakir Aug 29 '19 at 09:19
  • No, I meant that if within the same `['BldgID','BldgHt','Device']` group, you have the same date twice, the method `count()` will count it twice. But you seem to be interested in _unique_ dates, hence they should only be counted once. – bexi Aug 29 '19 at 11:52
  • Thanks. Got it. Have a quest. If I groupby with 'BldgID, BldgHt, Device' with a 5th col ('Result'), but want to keep the 4th col 'Date' as it is in the df. Could it be done? Want to groupby with the 1st 3 val and later use the corresponding 'Date' col. Confusing question? – Fakir Sep 10 '19 at 11:28
2
df.groupby(['BldgID','BldgHt','Device'], sort=False)['Date'].size().reset_index()

Output

BldgID  BldgHt  Device  Date
0   108     28  760     4
1   104     36  758     2
2   104     41  758     6
3   104     45  758     2
moys
  • 7,747
  • 2
  • 11
  • 42
  • Thank you. BTW, just adding .reset_index() to the 2nd ans, it gives the same ans, as yours. What is the benefits of sort and size(). Could you pl tell? – Fakir Aug 29 '19 at 08:38
  • by default sort is `True`, so if you don't make it false, the order of your data & the output will be different. `size()` includes NaN values, `count()` does not: – moys Aug 29 '19 at 08:42