7

I'm trying to solve a problem in Python/Pandas which I think is closely related to the longest path algorithm. The DataFrame I'm working with has the following structure:

import numpy as np 
import pandas as pd

data = {
    "cusID": ["001", "001", "001", "001", "001", "001", "002", "002", "002"],
    "start": ["A", "B", "C", "D", "A", "E", "B", "C", "D"],
    "end": ["B", "C", "D", "A", "E", "A", "C", "D", "E"]
}
df = pd.DataFrame(data)
print(df)

  cusID start end
0   001     A   B
1   001     B   C
2   001     C   D
3   001     D   A
4   001     A   E
5   001     E   A
6   002     B   C
7   002     C   D
8   002     D   E

For each customer, I want to find the longest sequence which does not contain A. For instance, for customer 001, the sequence can be viewed as follows:

A -> B -> C -> D -> A -> E -> A

where B -> C -> D is the longest sequence of size 3.

The resulting DataFrame I'm looking for is the following:

  cusID longestSeq
0   001          3
1   002          4

Although I wasn't able to write much code to solve this problem, here are some of my thoughts: First, it's obvious that I need to group the original DataFrame by cusID to analyse each of the two sequences separately. One idea I had was to apply some function to transform the DataFrame into this format:

  cusID                       seq
0   001     [A, B, C, D, A, E, A]
1   002              [B, C, D, E]

and then working on each list separately, and use some sort of counter to take the maximum of the length of the paths that exclude A. My problem is transcribing that logic into code (if correct). Any help would be appreciated.

glpsx
  • 587
  • 1
  • 7
  • 21
  • What happens if you have a cycle in the middle? For example if you add (D, C) to the second group? – Dani Mesejo Nov 05 '20 at 12:46
  • @DaniMesejo If you change the data so that the last entry of the second group becomes (D, C), then the longest sequence should still be of size 4. – glpsx Nov 05 '20 at 13:06

2 Answers2

5

First step is to normalize the sequences.

seqs = pd.concat([
    df.drop(columns="end").rename(columns={"start":"node"}),
    df.groupby("cusID").tail(1).drop(columns="start").rename(columns={"end":"node"})
])
seqs = seqs.sort_values("cusID", kind="mergesort").reset_index(drop=True)

>>> seqs
   cusID node
0    001    A
1    001    B
2    001    C
3    001    D
4    001    A
5    001    E
6    001    A
7    002    B
8    002    C
9    002    D
10   002    E

Then, using zero_runs we define:

def longest_non_a(seq):
    eqa = seq == "A"
    runs = zero_runs(eqa)
    return (runs[:,1] - runs[:,0]).max()
    
result = seqs.groupby("cusID")["node"].apply(longest_non_a)

>>> result
cusID
001    3
002    4
Name: node, dtype: int64
orlp
  • 112,504
  • 36
  • 218
  • 315
  • It took me the whole afternoon to understand your solution in its entirety. But that was an excellent exercise for me. I've learned quite a lot through the process. Thanks! – glpsx Nov 05 '20 at 16:41
3

As this is a graph problem I suggest you use networkx:

import networkx as nx

data = {
    "cusID": ["001", "001", "001", "001", "001", "001", "002", "002", "002"],
    "start": ["A", "B", "C", "D", "A", "E", "B", "C", "D"],
    "end": ["B", "C", "D", "A", "E", "A", "C", "D", "E"]
}

df = pd.DataFrame(data)


def longest_path(d):
    # create graph from edge list
    dg = nx.convert_matrix.from_pandas_edgelist(d, source="start", target="end", create_using=nx.DiGraph)

    # remove "A" if exists 
    if "A" in dg.nodes:
        dg.remove_node("A")

    # compute the longest path in the graph
    return len(nx.dag.dag_longest_path(dg))


# group-by and compute the longest path
result = df.groupby("cusID").apply(longest_path).reset_index()

print(result)

Output

  cusID  0
0   001  3
1   002  4
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • This won't work because networkx will assume that if say B occurs twice they are the same node. – orlp Nov 05 '20 at 12:25
  • E.g. if you change the data so that the last entry becomes **D -> B** instead of **D -> E**, `networkx` will complain there's a cycle. – orlp Nov 05 '20 at 12:33