I'm trying to solve a problem in Python/Pandas which I think is closely related to the longest path algorithm. The DataFrame I'm working with has the following structure:
import numpy as np
import pandas as pd
data = {
"cusID": ["001", "001", "001", "001", "001", "001", "002", "002", "002"],
"start": ["A", "B", "C", "D", "A", "E", "B", "C", "D"],
"end": ["B", "C", "D", "A", "E", "A", "C", "D", "E"]
}
df = pd.DataFrame(data)
print(df)
cusID start end
0 001 A B
1 001 B C
2 001 C D
3 001 D A
4 001 A E
5 001 E A
6 002 B C
7 002 C D
8 002 D E
For each customer, I want to find the longest sequence which does not contain A. For instance, for customer 001, the sequence can be viewed as follows:
A -> B -> C -> D -> A -> E -> A
where B -> C -> D is the longest sequence of size 3.
The resulting DataFrame I'm looking for is the following:
cusID longestSeq
0 001 3
1 002 4
Although I wasn't able to write much code to solve this problem, here are some of my thoughts: First, it's obvious that I need to group the original DataFrame by cusID to analyse each of the two sequences separately. One idea I had was to apply some function to transform the DataFrame into this format:
cusID seq
0 001 [A, B, C, D, A, E, A]
1 002 [B, C, D, E]
and then working on each list separately, and use some sort of counter to take the maximum of the length of the paths that exclude A. My problem is transcribing that logic into code (if correct). Any help would be appreciated.