1

I am reading from a csv file which contains the following data:

Create Date,           Original Filesize,  Number of Encodings
2016-01-27 16:27:10,   855583191,          1
2016-01-27 16:28:11,   854336,             1
2016-01-27 16:28:25,   854336,             1
2016-01-27 16:30:12,   9691559,            1

I want to count the number of records generated every two minutes in the table.

E.g. if we start from 2016-01-27 16:27:00 then from 2016-01-27 16:27:01 to 2016-01-27 16:29:00, we have 2 records. from 2016-01-27 16:29:01 to 2016-01-27 16:31:00, we have 1 record and so on.

I am trying to use pandas as I want to do some data analysis on it later. Is there an easy way to do this with/without pandas?

Also, I use mysql to export this data in the first place. If it is easier using sql, that is also ok.

mx0
  • 6,445
  • 12
  • 49
  • 54
kosta
  • 4,302
  • 10
  • 50
  • 104
  • 1
    This is a duplicate of http://stackoverflow.com/questions/11073609/how-to-group-dataframe-by-a-period-of-time. – Peter Mar 29 '16 at 07:05

1 Answers1

4

I think you can use resample with sum:

print df.resample('2min').sum()
                     Original Filesize  Number of Encodings
Create Date                                                
2016-01-27 16:26:00          855583191                    1
2016-01-27 16:28:00            1708672                    2
2016-01-27 16:30:00            9691559                    1

print df.resample('2min', base=1).sum()
                     Original Filesize  Number of Encodings
Create Date                                                
2016-01-27 16:27:00          857291863                    3
2016-01-27 16:29:00            9691559                    1

Or if you need aggregate only column Number of Encodings:

print df.resample('2min')['Number of Encodings'].sum().reset_index()
          Create Date  Number of Encodings
0 2016-01-27 16:26:00                    1
1 2016-01-27 16:28:00                    2
2 2016-01-27 16:30:00                    1

print df.resample('2min', base=1)['Number of Encodings'].sum().reset_index()
          Create Date  Number of Encodings
0 2016-01-27 16:27:00                    3
1 2016-01-27 16:29:00                    1

More universal is aggregate by size, if you need only count rows:

print df.resample('2min').size().reset_index(name='count')
          Create Date  count
0 2016-01-27 16:26:00      1
1 2016-01-27 16:28:00      2
2 2016-01-27 16:30:00      1

print df.resample('2min', base=1).size().reset_index(name='count')
          Create Date  count
0 2016-01-27 16:27:00      3
1 2016-01-27 16:29:00      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252