0

I have a dataset of some criminal attacks that is structured in this way:

DATE ------ EVENT ----

01/02/2019 ---- "x has been attacked" -----

01/02/2019 ---- "y has been attacked" ----

01/02/2019 ---- "" ----

03/02/2019 ---- "..." ----

I would like to know if there is an efficient way in Python to transform this dataset in a count time series.

That is, a time series that:

  • has no jumps between the dates
  • has in the second column the number of events for each date (including the 0)

Smething like that

DATE ------ NUMBER OF EVENTS ----

01/02/2019 ---- 3 ----

02/02/2019 ---- 0 ----

03/02/2019 ---- 1 ---- ...

Thanks a lot.

pietrosan
  • 23
  • 5
  • Please include a small subset of your data as a __copyable__ piece of code that can be used for testing as well as your expected output. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). – Henry Ecker May 08 '21 at 17:44
  • df.groupby('DATE').count() – adhg May 08 '21 at 17:56

2 Answers2

1

Assuming your dataset is a CSV file. Then you can use pandas.read_csv to read the file.

The result would be like this:

import pandas as pd
df = pd.read_csv('data.csv') 
print(df)

    date        event
0   2021-01-01  x has been attacked
1   2021-01-01  y has been attacked
2   2021-01-01  z has been attacked
3   2021-01-02  a has been attacked
4   2021-01-03  e has been attacked
5   2021-01-03  b has been attacked
6   2021-01-04  c has been attacked
7   2021-01-05  d has been attacked

Then you can use pandas groupby to get your desired result

data = df.groupby("date").size().reset_index(name='number of events')
data["date"] =  pd.to_datetime(data["date"] )
final_df= pd.merge(pd.DataFrame({"date":pd.date_range(min(df.date),max(df.date))}),data,
                   on="date", how="outer").fillna(0)
print(final_df)


    date        number of events
0   2021-01-01  3.0
1   2021-01-02  0.0
2   2021-01-03  2.0
3   2021-01-04  1.0
4   2021-01-05  1.0
Rishin Rahim
  • 655
  • 4
  • 14
0

I'd use groupby to group by date and get the count of events per date.

Code

df['DATE']  = pd.to_datetime(df.DATE, dayfirst=True)
df.groupby('DATE').count().reset_index().rename(columns={'EVENT':'Number of Events'})

Sample Input

d="""DATE,EVENT
01/02/2019,x has been attacked
01/02/2019,y has been attacked
01/02/2019,x attacked
03/02/2019,z attacked
03/02/2019,f attacked
04/02/2019,a attacked
05/02/2019,l attacked
05/02/2019,q attacked"""
df=pd.read_csv(StringIO(d))

Output

    DATE        Number of Events
0   2019-02-01  3
1   2019-02-03  2
2   2019-02-04  1
3   2019-02-05  2

Scenario 2:

In case we need all dates including the dates when no criminal attack was done like 2019-02-02 in above case.

resample can be used for this case.

Code

df['DATE']  = pd.to_datetime(df.DATE, dayfirst=True)
df.set_index('DATE').resample('D').count().reset_index().rename(columns={'EVENT':'Number of Events'})

Output

    DATE        Number of Events
0   2019-02-01  3
1   2019-02-02  0
2   2019-02-03  2
3   2019-02-04  1
4   2019-02-05  2
Utsav
  • 5,572
  • 2
  • 29
  • 43