Efficient way to find connected ticket validations in Pandas Dataframe

Question

The thing is that I currently have a pandas dataframe (which I am going to denote by validations with the following columns

|-----------------------------------------------------------------|
| line | orientation | route | validationDate | cardNumber | stop |
|-----------------------------------------------------------------|
|   1  |     2       |  2    |1994-01-18,18:00| O219838111 | 2393 |
|   1  |     1       |  1    |1994-01-18,18:03| O211233111 | 2400 |
|  ... |    ...      | ...   |      ...       |     ...    |  ... |

My goal is to find all validations that are connected, that is: look for pairs of entries with the same cardNumber that have taken place during the same day, regardless of whether it took place on the same line, orientation, bus stop or route.

The thing is that my "grouping" skills are a bit limited so I haven't come up with a better solution than to use one big loop using

itertools.product(validations.iterrows(), validations.iterrows())

But as expected this simply takes too long.

Any ideas?

Thanks in advance!

Hi! It would be really helpful if you can edit your question to include sample input and expected output. Please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — tomjn, Jun 18 '21 at 10:57
Please see the following link: [Grouping by multiple columns to find duplicate rows pandas](https://stackoverflow.com/questions/46640945/grouping-by-multiple-columns-to-find-duplicate-rows-pandas) — le_camerone, Jun 18 '21 at 11:21

Efficient way to find connected ticket validations in Pandas Dataframe

0 Answers0