0

My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).

Here's a data sample (pandas dataframe), named All_Data:

HoraDTRetirada        idpass        type                                             
2016-02-17 15:36:00  39579449489   'single'  
2016-02-18 19:13:00  39579449489   'single' 
2016-02-26 09:20:00  72986744521   'pseudo' 
2016-02-27 12:11:00  72986744521   'round'  
2016-02-27 14:55:00  11533148958   'pseudo'
2016-02-28 12:27:00  72986744521   'round'
2016-02-28 16:32:00  72986744521   'round'

I would like to count the number of times each category repeats in a "week of year" by user.

For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.

The output I am looking for would be in a form like this:

idpass        weekofyear   type      frequency
39579449489    1           'single'   2
72986744521    2           'round'    3
72986744521    2           'pseudo'   1
11533148958    2           'pseudo'   1

Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.

Helk
  • 121
  • 10

3 Answers3

1
import pandas as pd

data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
                           "2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
        "idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
                   "72986744521"],
        "type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)

Output:

        HoraDTRetirada       idpass    type
0  2016-02-17 15:36:00  39579449489  single
1  2016-02-18 19:13:00  39579449489  single
2  2016-12-31 09:20:00  72986744521  pseudo
3  2016-02-28 12:11:00  72986744521   round
4  2016-02-28 14:55:00  11533148958  pseudo
5  2016-02-29 12:27:00  72986744521   round
6  2016-02-29 16:32:00  72986744521   round
        idpass week    type  HoraDTRetirada
0  11533148958   09  pseudo               1
1  39579449489   07  single               2
2  72986744521   09   round               3
3  72986744521   52  pseudo               1
Himaprasoon
  • 2,609
  • 3
  • 25
  • 46
  • my original dataset has many more columns than the ones listed in the sample, so this method ends up printing out all the other columns that I don't want. The week numbering is just what I wanted though. Thanks for that. – Helk Jul 06 '17 at 13:52
  • I got the desired output by just doing: `All_Data.groupby(['idpass','type','week']).size()'` instead of your last line. – Helk Jul 06 '17 at 14:09
1

This is how I got what I was looking for:

Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.

Step 2: create column for week of year:

df['week'] = df['HoraDTRetirada'].dt.strftime('%U')

Step 3: group by user id, type and week, and count values with size()

df.groupby(['idpass','type','week']).size()

Helk
  • 121
  • 10
0

My suggestion would be to do this:

enter image description here

  1. make sure your timestamp is pandas datetime and add frequency column

    df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])

    df['freq'] = 1

  2. Group it and count

    res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()

  3. Convert time to week of a year

    res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)

Final result looks like that:

enter image description here

EDIT:

You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:

res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)

and step 3 :

res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()

It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.

Kacper Wolkowski
  • 1,517
  • 1
  • 16
  • 24
  • i wanted to group by week, so I think in your model, the step 3 should come before the step 2, grouping by week instead of datetime. I tried doing that but I got the following error: `TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'` – Helk Jul 06 '17 at 13:50
  • Updated - let me know if it's fine now – Kacper Wolkowski Jul 06 '17 at 14:40