Comparing lists with every record in DataFrame

Question

I have a use case where I am comparing the list in same column with itself, code Below:

for i in range(0,len(counts95)):
    for j in range(i+1,len(counts95)):
        for x in counts95['links'][i]:
            for y in counts95['links'][j]:
                if x == y and counts95['linkoflinks'][j] is None:
                    counts95['linkoflinks'][j] = counts95['index'][i]

The code works but its not python friendly (using 4 for loops) and takes a huge amount of time to do the operation. The main idea behind it is linking the records where the elements in list at counts95['links'] is in any of the proceeding rows, if yes update the column linksoflinks with the index of first column only if linksoflinks column is None (no overwriting)

find the reference table below:

counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754], 
                   'level0': [25,30,35,100],
                   'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16]],
                   'linksoflinks' : [None,None,None,None]})

EDIT: New Dataframe

counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754,6566666,464664683], 
                   'level0': [25,30,35,100,200,556],
                   'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16],[1,14],[14,1]],
                   'linksoflinks' : [None,None,None,None,None,None]})

Desired output:

     index  level0            links  linksoflinks
0   616351      25  [1, 2, 3, 4, 5]         NaN
1   616352      30      [23, 45, 2]    616351.0
2   616353      35      [1, 19, 67]    616351.0
3  6457754     100     [14, 15, 16]         NaN
4  6566666     200           [1,14]    616351.0
5  6457754     556           [14,1]    616351.0

kindly share data and not pics. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — sammywemmy, Mar 02 '20 at 21:28

Andy L. · Accepted Answer · 2020-03-05T18:30:32.883

Your desired output uses different values and column name compare to your sample dataframe constructor. I use your desired output dataframe for testing.

Logic:
For each sublist of links, we need to find the row index(I mean index of the dataframe, NOT columns index) of the first overlapped sublist. We will use these row indices to slice by .loc on counts95 to get corresponding values of column index. To achieve this goal we need to do several steps:

Compare each sublist to all sublists in link. List comprehension is fast and efficient for this task. We need to code a list comprehension to create boolean 2D-mask array where each subarray contains True values for overlapped rows and False for non-overlapped(look at the step-by-step on this 2D-mask and check with column links you will see clearer)
We want to compare from top to the current sublist. I.e. standing from current row, we only want to compare backward to the top. Therefore, we need to set any forward-comparing to False. This is the functionality of np.tril
Inside each subarray of this 2D-mask the position/index of True is the row index of the row which the current sublist got overlapped. We need to find these positions of True. It is the functionality of np.argmax. np.argmax returns the position/index of the first max element of the array. True is considered as 1 and False as 0. Therefore, on any subarray having True, it correctly returns the 1st overlapped row index. However, on all False subarray, it returns 0. We will handle all False subarray later with where
After np.argmax, the 2D-mask is reduce to 1D-mask. Each element of this 1D-mask is the number of row index of the overlapped sublist. Passing it to .loc to get corresponding values of column index. However, the result also wrongly includes row where subarray of 2D-mask contains all False. We want these rows turn to NaN. It is the functionality of .where

Method 1:
Use list comprehension to construct the boolean 2D-mask m between each list of links and the all lists in links. We only need backward-comparing, so use np.tril to crush upper right triangle of the mask to all False which represents forward-comparing. Finally, call np.argmax to get position of first True in each row of m and chaining where to turn all False row of m to NaN

c95_list = counts95.links.tolist()
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
counts95['linkoflist'] = (counts95.loc[np.argmax(m, axis=1), 'index']
                                  .where(m.any(1)).to_numpy())

 Out[351]:
     index  level0            links  linkoflist
0   616351      25  [1, 2, 3, 4, 5]         NaN
1   616352      30      [23, 45, 2]    616351.0
2   616353      35      [1, 19, 67]    616351.0
3  6457754     100     [14, 15, 16]         NaN
4  6566666     200          [1, 14]    616351.0
5  6457754     556          [14, 1]    616351.0

Method 2:
If you dataframe is big, comparing each sublist to only top part of links makes it faster. It probably 2x faster method 1 on big dataframe.

c95_list = counts95.links.tolist()
m = [[any(x in l2 for x in l1) for l2 in c95_list[:i]] for i,l1 in enumerate(c95_list)]
counts95['linkoflist'] = counts95.reindex([np.argmax(y) if any(y) else np.nan 
                                                   for y in m])['index'].to_numpy()

Step by Step(method 1)

m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)

Out[353]:
array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [False, False, False, False, False, False],
       [ True, False,  True,  True, False, False],
       [ True, False,  True,  True,  True, False]])

argmax returns position both first True and first False of all-False row.

In [354]: np.argmax(m, axis=1)
Out[354]: array([0, 0, 0, 0, 0, 0], dtype=int64)

Slicing using the result of argmax

counts95.loc[np.argmax(m, axis=1), 'index']

Out[355]:
0    616351
0    616351
0    616351
0    616351
0    616351
0    616351
Name: index, dtype: int64

Chain where to turn rows corresponding to all False from m to NaN

counts95.loc[np.argmax(m, axis=1), 'index'].where(m.any(1))

Out[356]:
0         NaN
0    616351.0
0    616351.0
0         NaN
0    616351.0
0    616351.0
Name: index, dtype: float64

Finally, the index of the output is different from the index of counts95, so just call to_numpy to get the ndarray to assign to the column linkoflist of counts95.

this got my loop down from 3 hours to 6 minutes for a total of 15000 rows. I know you have given a good explanation here, can you please add your overall logic for this to better understand the code, I would highly appreciate that and mark the answer as correct. — Rishi, Mar 05 '20 at 16:58
@Rishi: I added detail logic to the answer. I hope it helps :) — Andy L., Mar 05 '20 at 18:31

score 0 · Answer 2 · answered Mar 02 '20 at 22:07

The good pattern would be using proper data structures for your task. The best choice on answering the question «is element X present in Y sequence» is the built-in set. When your sets are immutable, consider using frozenset.

Solution

Here is how I would solve the problem in pythonic way:

# necessary imports
from collections import defaultdict
from typing import Tuple, FrozenSet, DefaultDict

# initialise the links "mapping": for every index save frozenset of its links
links: Tuple[Tuple[int, FrozenSet[int]]] = (
    # tuple of tuples is like a dict but will let you iterate by index
    (616351, frozenset((1, 2, 3, 4, 5))),
    (616352, frozenset((23, 45, 2))),
    (616353, frozenset((1, 19, 67))),
    (6457754, frozenset((14, 15, 16))),
)

# defaultdict automatically creates new lists
#   as you access its keys which are not yet present
links_of_links: DefaultDict[int, List[int]] = defaultdict(list)

for i, item in enumerate(links):
    key, values = item  # split tuple into individual elements
    next_rows = links[i+1:]  # we will iterate over succeeding rows
    for next_key, next_values in next_rows:
        # here we check sets intersection:
        #   it is non-empty if any common elements are present
        if values & next_values:
            # though key might not be present in links_of_links,
            #   defaultdict will autocreate a new empty list
            links_of_links[key].append(next_key)

Contents of links_of_links: defaultdict(<class 'list'>, {616351: [616352, 616353]})

Complexity

Let's now compare the complexity of your solution and mine to prove latter is more efficient. Let's assume N is the number of rows and L is some sort of the length of links lists (average or maximum, it does not really matter). Your solutions compares roughly all rows pairs which gives us O(N * N). This is then multiplied by complexity of naive comparison of two lists — O(L * L). It gives us O(N * L)² in total.

The proposed solution still cross joins all the rows so N * N stays with us. But now we compare sets themselves in a more efficient way: O(min(L, L)) === O(L), as Python Time Complexity says. So overall complexity is divided by single L, giving O(N² * L) as total.

This gives the desired output but how can I change my preexisting Dataframe to use frozensets? — Rishi, Mar 02 '20 at 22:37
@Rishi it seems to me that there is not problem using frozensets, pandas preserves its type: `pd.DataFrame({'x': frozenset([1, 2])}).iloc[0]['x']` is `frozenset({1, 2})` — Anton Bryzgalov, Mar 02 '20 at 22:49
with the updated data set in the question, this is not giving the desired output, I have also added a desired output to the Question for reference — Rishi, Mar 04 '20 at 20:39

Umar.H · Answer 3 · 2020-03-04T19:18:41.857

0

using explode and duplicated and .map to assign to duplicate link values but only the latter ones.

df = counts95.explode('links')


m = df[df.duplicated(subset=['links'],keep=False)].groupby('links')['index'].first()


df['link_above'] = df['links'].loc[df.duplicated(subset='links',keep='first')].map(m)



re_made_df = df.groupby(["index", "level0"]).agg(
    links=("links", list), linkoflist=("link_above", "first")).reset_index()


print(re_made_df)


     index  level0            links  linkoflist
0   616351      25  [1, 2, 3, 4, 5]         NaN
1   616352      30      [23, 45, 2]    616351.0
2   616353      35      [1, 19, 67]    616351.0
3  6457754     100     [14, 15, 16]         NaN

edited Mar 04 '20 at 19:18

answered Mar 02 '20 at 22:22

Umar.H

22,559
7
39
74

what is link_above here? – Rishi Mar 03 '20 at 02:24
It's your column, I just used a dif name before recreating it @rishi – Umar.H Mar 03 '20 at 06:33
Can you explain this code a bit please, like a step by step process of what you are doing here, that could help me a lot. I see this working but I have to understand it, rather than just copy-pasting it. – Rishi Mar 04 '20 at 18:39
@Rishi just in the middle of a project but will do, did it work with your solution? for now just print each line by line to see whats happening and read the docs for `explode` and `map` i'll get back to you soon – Umar.H Mar 04 '20 at 18:40
I changed the link_above to link, but the code doesn't work as reset_index() gives an error – Rishi Mar 04 '20 at 18:51
Sorry @Rishi like a fool I copied in the code incorrectly..! can you re-check? – Umar.H Mar 04 '20 at 19:19
1

thanks for your help, but there is one problem, if there are repeating lists with two different indexes it gives an error: cannot reindex from a duplicate axis, i am editing the question to add 2 more rows to the error can be recreated. – Rishi Mar 04 '20 at 19:46

null · Answer 4 · 2020-03-06T07:19:01.377

Just another alternative where you can manipulate data more;

Code

import pandas as pd

counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754,6566666,464664683], 
                   'level0': [25,30,35,100,200,556],
                   'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16],[1,14],[14,1]],
                   'linksoflinks' : [None,None,None,None,None,None]})

def has_match(ar1, ar2):
    return bool(set(ar1).intersection(ar2))

def set_linksoflinks(df):
    for i, row in df.iterrows():
        j = i+1
        while j<df.shape[0]:
            check = has_match(row['links'], df.loc[j, 'links'])
            if check and not df.loc[j, 'linksoflinks']:
                df.loc[j, 'linksoflinks'] = row['index']
            j+=1
    return df.copy()

df = set_linksoflinks(counts95)

print(df)

Output

       index  level0            links linksoflinks
0     616351      25  [1, 2, 3, 4, 5]         None
1     616352      30      [23, 45, 2]       616351
2     616353      35      [1, 19, 67]       616351
3    6457754     100     [14, 15, 16]         None
4    6566666     200          [1, 14]       616351
5  464664683     556          [14, 1]       616351

this is not the intended output, as the row 4 and 5 should have 616351, this is where we are marking duplicates with one value only — Rishi, Mar 05 '20 at 22:14

Comparing lists with every record in DataFrame

4 Answers4

Solution

Complexity