Your desired output uses different values and column name compare to your sample dataframe constructor. I use your desired output dataframe for testing.
Logic:
For each sublist of links
, we need to find the row index(I mean index of the dataframe, NOT columns index
) of the first overlapped sublist. We will use these row indices to slice by .loc
on counts95
to get corresponding values of column index
. To achieve this goal we need to do several steps:
- Compare each sublist to all sublists in
link
. List comprehension is
fast and efficient for this task. We need to code a list
comprehension to create boolean 2D-mask array where each subarray
contains True
values for overlapped rows and False
for non-overlapped(look at the step-by-step on this
2D-mask and check with column links
you will see clearer)
- We want to compare from top to the current sublist. I.e. standing
from current row, we only want to compare backward to the top.
Therefore, we need to set any forward-comparing to
False
. This is
the functionality of np.tril
- Inside each subarray of this 2D-mask the position/index of
True
is
the row index of the row which the current sublist got overlapped. We need to find
these positions of True
. It is the functionality of np.argmax
.
np.argmax
returns the position/index of the first max element of the array. True
is considered as 1
and False
as 0
. Therefore,
on any subarray having True
, it correctly returns the 1st overlapped row index. However, on all False
subarray, it returns 0
. We will handle all False
subarray later with where
- After
np.argmax
, the 2D-mask is reduce to 1D-mask. Each element of
this 1D-mask is the number of row index of the overlapped sublist.
Passing it to .loc
to get corresponding values of column index
.
However, the result also wrongly includes row where subarray of
2D-mask contains all False
. We want these rows turn to NaN
. It is
the functionality of .where
Method 1:
Use list comprehension to construct the boolean 2D-mask m
between each list of links
and the all lists in links
. We only need backward-comparing, so use np.tril
to crush upper right triangle of the mask to all False
which represents forward-comparing. Finally, call np.argmax
to get position of first True
in each row of m
and chaining where
to turn all False
row of m
to NaN
c95_list = counts95.links.tolist()
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
counts95['linkoflist'] = (counts95.loc[np.argmax(m, axis=1), 'index']
.where(m.any(1)).to_numpy())
Out[351]:
index level0 links linkoflist
0 616351 25 [1, 2, 3, 4, 5] NaN
1 616352 30 [23, 45, 2] 616351.0
2 616353 35 [1, 19, 67] 616351.0
3 6457754 100 [14, 15, 16] NaN
4 6566666 200 [1, 14] 616351.0
5 6457754 556 [14, 1] 616351.0
Method 2:
If you dataframe is big, comparing each sublist to only top part of links
makes it faster. It probably 2x faster method 1 on big dataframe.
c95_list = counts95.links.tolist()
m = [[any(x in l2 for x in l1) for l2 in c95_list[:i]] for i,l1 in enumerate(c95_list)]
counts95['linkoflist'] = counts95.reindex([np.argmax(y) if any(y) else np.nan
for y in m])['index'].to_numpy()
Step by Step(method 1)
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
Out[353]:
array([[False, False, False, False, False, False],
[ True, False, False, False, False, False],
[ True, False, False, False, False, False],
[False, False, False, False, False, False],
[ True, False, True, True, False, False],
[ True, False, True, True, True, False]])
argmax
returns position both first True
and first False
of all-False
row.
In [354]: np.argmax(m, axis=1)
Out[354]: array([0, 0, 0, 0, 0, 0], dtype=int64)
Slicing using the result of argmax
counts95.loc[np.argmax(m, axis=1), 'index']
Out[355]:
0 616351
0 616351
0 616351
0 616351
0 616351
0 616351
Name: index, dtype: int64
Chain where
to turn rows corresponding to all False
from m
to NaN
counts95.loc[np.argmax(m, axis=1), 'index'].where(m.any(1))
Out[356]:
0 NaN
0 616351.0
0 616351.0
0 NaN
0 616351.0
0 616351.0
Name: index, dtype: float64
Finally, the index of the output is different from the index of counts95
, so just call to_numpy
to get the ndarray to assign to the column linkoflist
of counts95
.