10

I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.

I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. But if someone wants me to post the code I will, of course.

The desired output would be something like this:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.

Any help or hint is welcomed here, truly. Thanks in advance.

jovicbg
  • 1,523
  • 8
  • 34
  • 74
  • 1
    You do of course remember it's not *"please write my code for me"* site? Can you please show us your work? What is the exact problem with your code? – rsm Apr 20 '18 at 22:32
  • @rsm As I said, I can post my code, but it would make a post huge and I don't think it will be used by any of helpers. You could just write a comment that I need to add my code to and I will, just not to be arrogant. – jovicbg Apr 20 '18 at 22:36
  • Can you please show us your work, add any relevant(!) code you have? And explain problem you have with it? If you expect us to come up with an algorithm and implementation to your problem - it's not how this site works. – rsm Apr 20 '18 at 22:44
  • @jovicbg There is a typo in `df2` : the `end_point` of `q` should be `r`, not `s`. I probably have a simple solution for your problem, but its performance is not good on large dataframes. What's the approximate size of your dataframes ? – Qusai Alothman Apr 21 '18 at 19:27
  • @QusaiAlothman Thanks, I have edit it. It was a good ent_point, but I skip one step (q - r). It's not really large, 4000 of rows. – jovicbg Apr 21 '18 at 23:28
  • Maybe I'm missing something, but isn't `d` the endpoint (or "descendant") for `a`, `b`, & `c`? – JohnE Apr 24 '18 at 15:07
  • No, df2['end_point'] show us where the need to end, if there are further links (Side A - Side B), it does not matter, just stop where end_point says. So, C - D would be just skipped. – jovicbg Apr 24 '18 at 18:32
  • @jovicbg How sure are you that there will only ever be one path from start to end, and if there are multiple paths, what should happen? – chthonicdaemon Apr 26 '18 at 14:35

3 Answers3

4

You can use networkx library and graphs:

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)

Output:

  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • I have data in numeric type but I have cast to string and now got an erro: ('Either source 7163 or target 1019 is not in G', u'occurred at index 5'). Is there any way to skip if catch error like this, maybe really is not defined the target in df2 for some side_a. – jovicbg Apr 27 '18 at 13:54
  • Hrm... You might need to filter your df2 dataframe first for where side_a and end_point both appear in df. – Scott Boston Apr 27 '18 at 13:58
  • @ScottBoston I did, but doesn't help. Your code works fine, but have no idea where is the problem. – jovicbg May 04 '18 at 13:10
3

Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely.

Transform the df to a dict to better represent a tree structure

This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure.

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))

Traverse the Tree

This is much more straightforward now that the data structure is intuitive. Any standard Depth-First-Search algorithm w/ path saving will do the trick. I modified the one in the link to work with this example.

Edit: Reading again it looks you have a condition for search termination in endpoint (you need to be more clear in your question what is input and what is output). You can adjust dfs_path(tree,**target**, root) and change the termination condition to return only the correct paths.

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
            else:
                stack.append((nextNode, path + [nextNode]))
        

Build a dataframe from the generators we yielded

If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. instead of a generator

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():
        set_a.append(list(gen.keys()).pop())
        end_points.append(row)
                      
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

Output

df_2[['set_a','end_points']]


set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}
Community
  • 1
  • 1
Brendan Frick
  • 1,047
  • 6
  • 19
  • I'm going to try with this later, as soon as possible. Column end_point in df2 represent where iteration for each side_a should stop. So end_point for A should not contain D. In df2 for A end_point is C. Maybe is confusing because I named output column end_point too. I'm sorry. – jovicbg Apr 25 '18 at 10:18
  • The comment i added should fix that. Just add a parameter for endpoint and make the termination condition nextNode == endPoint – Brendan Frick Apr 25 '18 at 13:13
  • Will this work with a numeric datatype, not strings? – jovicbg Apr 27 '18 at 13:30
  • I got an error : KeyErrorTraceback (most recent call last) in () 3 gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items] 4 for gen in gen_dict: ----> 5 for row in list(gen.values()).pop(): 6 set_a.append(list(gen.keys()).pop()) 7 end_points.append(row) KeyError: '1' – jovicbg Apr 27 '18 at 13:51
  • I just ran the exact code posted and it output the correct values. Did you name a variable `list` at some point? – Brendan Frick Apr 27 '18 at 15:08
  • No, I did not, now I just got the error '0' for this line ----> for row in list(gen.values()).pop(): . – jovicbg May 03 '18 at 12:45
2

If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX:

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 
                  axis=1)    

outdf looks like this. Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way.

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66