0

I need to count the number of consecutive occurences where hole_in_one equals one in a very large dataset. As I cannot iterate I'm not sure how to approach this problem. Some more background info ...

I'm working with an very large dataset (> 1e9 rows), which basically contains the following:

| user_id | timestamp | hole_in_one |
+---------+-----------+-------------+
| 100     | 1000      | 0           |
| 100     | 1100      | 0           |
| 100     | 1204      | 1           |
| 100     | 3000      | 1           |
| 100     | 4000      | 1           |
| 200     | 1001      | 0           |
| 200     | 2000      | 0           |
| 200     | 3000      | 1           |
| 200     | 4000      | 0           |
...

This data is stored in a .csv file, I'm using the python datatable library to read in this data.

train_df = dt.fread("data.csv")

To the dataset I want to add a column which counts the number of consecutive hole_in_one occurences (count_streak) for each user. E.g.:

| user_id | timestamp | hole_in_one | count_streak |
+---------+-----------+-------------+--------------+
| 100     | 1000      | 0           | 0
| 100     | 1100      | 0           | 0
| 100     | 1204      | 1           | 0
| 100     | 3000      | 1           | 1
| 100     | 4000      | 1           | 2
| 200     | 1001      | 1           | 0
| 200     | 2000      | 0           | 0
| 200     | 3000      | 1           | 0
| 200     | 4000      | 0           | 0
...

I know how to do basic operations (without iterating), e.g. filtering data, summing values etc. adding a column and such (note: datatable is similar to R programming language). However as I'm dealing with a very large dataset, I cannot simply iterate over the rows to count the number of consecutive hole_in_one occurences, I'm fine with using any available python library however I believe datatable should be fit for this problem. Hopefully someboby can point me in the right direction on how to deal with this problem.

Gio
  • 3,242
  • 1
  • 25
  • 53
  • Is `user_id` column sorted? So there cannot be `user_id=100` after `user_id=200`? – Andrej Kesely Oct 15 '20 at 21:48
  • Yes the data is sorted by `user_id` first, and secondly by `timestamp` for each `user_id`. I will update the example to make that a bit more clear. – Gio Oct 15 '20 at 21:49
  • If you aren't married to the idea of using `datatable` and will consider `pandas`, take a look at https://stackoverflow.com/a/27626699/843953 – Pranav Hosangadi Oct 15 '20 at 21:59
  • I'm not married to `datatable` indeed, however with `pandas` I cannot even load the dataset as it's way to big (I get an out of memory error on my 32GB ram machine). – Gio Oct 15 '20 at 22:05
  • Although I have to add that I could perhaps do some batch processing, e.g. split data per `user_id`, convert to pandas dataframe with `datatable.Frame.to_pandas`, however I wonder how much performance I would loose there. Nevertheless it's worth trying if no better / more straight forward option is available. – Gio Oct 15 '20 at 22:09

0 Answers0