I need to count the number of consecutive occurences where hole_in_one
equals one in a very large dataset. As I cannot iterate I'm not sure how to approach this problem. Some more background info ...
I'm working with an very large dataset (> 1e9 rows), which basically contains the following:
| user_id | timestamp | hole_in_one |
+---------+-----------+-------------+
| 100 | 1000 | 0 |
| 100 | 1100 | 0 |
| 100 | 1204 | 1 |
| 100 | 3000 | 1 |
| 100 | 4000 | 1 |
| 200 | 1001 | 0 |
| 200 | 2000 | 0 |
| 200 | 3000 | 1 |
| 200 | 4000 | 0 |
...
This data is stored in a .csv file, I'm using the python datatable library to read in this data.
train_df = dt.fread("data.csv")
To the dataset I want to add a column which counts the number of consecutive hole_in_one occurences (count_streak) for each user. E.g.:
| user_id | timestamp | hole_in_one | count_streak |
+---------+-----------+-------------+--------------+
| 100 | 1000 | 0 | 0
| 100 | 1100 | 0 | 0
| 100 | 1204 | 1 | 0
| 100 | 3000 | 1 | 1
| 100 | 4000 | 1 | 2
| 200 | 1001 | 1 | 0
| 200 | 2000 | 0 | 0
| 200 | 3000 | 1 | 0
| 200 | 4000 | 0 | 0
...
I know how to do basic operations (without iterating), e.g. filtering data, summing values etc. adding a column and such (note: datatable is similar to R programming language). However as I'm dealing with a very large dataset, I cannot simply iterate over the rows to count the number of consecutive hole_in_one occurences, I'm fine with using any available python library however I believe datatable should be fit for this problem. Hopefully someboby can point me in the right direction on how to deal with this problem.