3

I have a dataset with several date fields including hours. I want to use one of them as my df index, and count the number of entries which where created each day. In other words, if I have:

Date | Several features
2020-02-08 10h00 | ...
2020-02-08 11h00 | ...
2020-02-10 10h00 | ...
2020-02-10 11h00 | ...
2020-02-10 13h00 | ... 

I want to get:

2020-02-08 | 2
2020-02-10 | 3

For this, I am doing:

df["datetime"] = pd.to_datetime(df["datetime"])
df = df.set_index('datetime')
df.resample('D')["id"].count()

where id is an unique identifier each entry has.

However, I am getting the following output:

2020-02-08 | 2
2020-02-09 | 0
2020-02-10 | 3

How can I get rid of the "2020-02-09" row? I only want to count the occurences of the days I have on my dataset, not the ones I do not have.

Luiscri
  • 913
  • 1
  • 13
  • 40

1 Answers1

2

Use Series.dt.date with DataFrame.groupby:

df["datetime"] = pd.to_datetime(df["datetime"])
df = df.groupby(df["datetime"].dt.date)["id"].count()

If need DatetimeIndex working also well in oldier pandas versions use Series.dt.normalize:

df["datetime"] = pd.to_datetime(df["datetime"])
df = df.groupby(df["datetime"].dt.normalize())["id"].count()

In your solution if are generated missing values remove them:

df["datetime"] = pd.to_datetime(df["datetime"])
df = df.set_index('datetime').resample('D')["id"].count().dropna()

Or remove 0 rows:

df["datetime"] = pd.to_datetime(df["datetime"])
df = df.set_index('datetime').resample('D')["id"].count().loc[lambda x: x!= 0]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I'd stay away from `dt.date` as Pandas doesn't play very well with `datetime.date`. – Quang Hoang Feb 10 '21 at 14:37
  • 1
    @QuangHoang - Ya, in some oldier versions, in new one should working better, but agree, I can add normalize alternative. Thank you. – jezrael Feb 10 '21 at 14:38
  • Thanks for your answer. If I use your solution and then I want to group by month, is it possible or in that case I should use my approach and go for a `.resample('M')` ? @jezrael – Luiscri Feb 10 '21 at 15:04
  • 1
    @Luiscri Tnes is possible use resample solutions from my answers or `df = df.groupby(df["datetime"].dt.to_period('m')) ["id"].count()` – jezrael Feb 10 '21 at 15:31