Analysis of changes of categorical variables in the dataframe

Question

I have pivot dataframe with information on what territory many travelers are on their journey to around the world. My pivot df looks something like this:

     Name           Anna         Robert        James
Date               
2018-10-01        Bulgaria       Spain         Croatia
2018-10-02        Portugal       NaN           Portugal  
2018-10-03        Spain          USA           Spain
2018-10-04        USA            USA           Spain
2018-10-05        USA            Canada        USA

There are 100 columns (100 travelers) and 300 days.

Based on such data, how can I explore which routes are the most popular? It can be seen at first glance that all of them came to the USA from Spain. Robert also flew to the USA from Spain only that his flight lasted 2 days. Two of the three presented travelers came to Spain from Portugal so this is also a popular route.

Is there any way to show popular routes using ML algorithms? I will be extremely grateful for any tips.

EDIT: We can assume that the route has 2 nodes, so based on this df Spain-USA is a popular route

Does this answer your question? [Algorithm to find common substring across N strings](https://stackoverflow.com/questions/2418504/algorithm-to-find-common-substring-across-n-strings) — jrook, Dec 27 '19 at 23:00
I think this question belongs on the data science or stats stack exchange, it's extremely interesting though. V.curious to see what solution you could apply. — Umar.H, Dec 27 '19 at 23:05
If you replace each country name with a symbol, your problem will effectively become finding the most common substring (you have not specified the constraint of the length of the common route or their contiguousness). For example, your data becomes: `S1:BPSUU`, `S2:SUUC`, `S3:CPSSU`. For more, refer to the linked question. — jrook, Dec 27 '19 at 23:07
Considering the edit, this question does not really need any complex ML or non-ML algorithms. As the answer to the question suggests, just a brute force search along the routes and storing the results plus a final count will yield the answer. — jrook, Dec 28 '19 at 01:02

KRKirov · Accepted Answer · 2019-12-28T01:33:46.753

1

The simplest thing you can do is to create origin-destination tuples by zipping each user column with its shifted self and to then pass the tuples to a Counter object.

import pandas as pd
from collections import Counter

df.fillna(method='ffill', inplace=True)

# Create a counter object and pass it the origin-destination tuples
counter = Counter()
for col in df.columns:
    routes = list(zip(df[col].shift(1, fill_value=df[col][0]), df[col]))
    routes = [(k, v) for k, v in routes if k != v]
    counter.update(routes)
counter.most_common(3)

Output:

counter.most_common(3)
Out[76]: 
[(('Spain', 'USA'), 3),
 (('Portugal', 'Spain'), 2),
 (('Bulgaria', 'Portugal'), 1)]

edited Dec 28 '19 at 01:33

answered Dec 28 '19 at 00:32

KRKirov

3,854
2
16
20

Since the OP has not really specified what they mean by *route* and if there are any length or other constraints on the routes, I do not think we can assume a route only contains two nodes. What if the most popular route is something like : *Portugal, Spain, US*? – jrook Dec 28 '19 at 00:48
This is a fair comment and indeed the OP may modify their question. However, currently the question discusses Spain-USA, and Portugal-Spain as popular routes. Therefore my current answer is based on the assumption that the length of each route is two. – KRKirov Dec 28 '19 at 00:53
Thank you for your response. I've just updated the OP that we can assume that the route has 2 nodes. KRKirov, after applying your code on the original df as the most popular routes I get pairs of the same countries, e.g. USA-USA, Portugal-Portugal. Do you know how to get rid of it? In general, your solution is great! Thanks! – coding_fan Dec 28 '19 at 01:10
I edited the answer and have added a list comprehension to the code which leaves only tuples with different origin and destination. – KRKirov Dec 28 '19 at 01:35

Analysis of changes of categorical variables in the dataframe

1 Answers1