12

i have dataframe with each row having a list value.

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

i have to do a calculate a score with one row and against all the other rows

For eg:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id.size

repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.

and create a N x N dataframe; such as this:

-  0  1    2  3
0  1  0.6  1  1
1  1  1    1  1 
2  1  1    1  1
3  1  1    1  1

Right now my code has just one for loop:

def scoreCalc(x,queryTData):
    #mathematical calculation
    commonTData = np.intersect1d(np.array(x),queryTData)
    return commonTData.size/queryTData.size

ids = list(df['feed_id'])
dfSim = pd.DataFrame()

for indexQFID in range(len(ids)):
    queryTData = np.array(df.loc[df['id'] == ids[indexQFID]]['list_of_value'].values.tolist())

    dfSim[segmentDfFeedIds[indexQFID]] = segmentDf['list_of_value'].apply(scoreCalc,args=(queryTData,))

Is there a better way to do this? can i just write one apply function instead doing a for-loop iteration. can i make it faster?

8 Answers8

7

If you data is not too big, you can use get_dummies to encode the values and do a matrix multiplication:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

Output:

          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:

   a  b  c  d
0  1  1  1  0
1  0  1  1  1
2  1  1  1  0
3  1  1  1  0

Once we have that, the size of intersection of the two rows, say, 0 and 1 is just their dot product, because a character belongs to both rows if and only if it is represented by 1 in both.

With that in mind, first use

df.list_of_value.explode()

to turn each cell into a series and concatenate all of those series. Output:

0    a
0    b
0    c
1    d
1    b
1    c
2    a
2    b
2    c
3    a
3    b
3    c
Name: list_of_value, dtype: object

Now, we use pd.get_dummies on that series to turn it to a one-hot-encoded dataframe:

   a  b  c  d
0  1  0  0  0
0  0  1  0  0
0  0  0  1  0
1  0  0  0  1
1  0  1  0  0
1  0  0  1  0
2  1  0  0  0
2  0  1  0  0
2  0  0  1  0
3  1  0  0  0
3  0  1  0  0
3  0  0  1  0

As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)

gives the binary-encoded dataframe we want. The next line

s.dot(s.T).div(s.sum(1))

is just as your logic: s.dot(s.T) computes dot products by rows, then .div(s.sum(1)) divides counts by rows.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
4

Try this

range_of_ids = range(len(ids))

def score_calculation(s_id1,s_id2):
    s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
    s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
    # Resultant calculation s1&s2
    return round(len(s1&s2)/len(s1) , 2)


dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)

Output

     0        1      2       3
0   1.00    0.67    1.00    1.00
1   0.67    1.00    0.67    0.67
2   1.00    0.67    1.00    1.00
3   1.00    0.67    1.00    1.00

You can also do it as following

dic = {indexQFID:  [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)
FAHAD SIDDIQUI
  • 631
  • 4
  • 22
3

Use nested list comprehension on the list of set s_list. Within list comprehension, use intersection operation to check overlapping and get length of each result. Finally, construct the dataframe and divide it by the length of each list in df.list_of_value

s_list =  df.list_of_value.map(set)
overlap = [[len(s1 & s) for s1 in s_list] for s in s_list]

df_final = pd.DataFrame(overlap) / df.list_of_value.str.len().to_numpy()[:,None]

Out[76]:
          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

In case there are duplicate values in each list, you should use collections.Counter instead of set. I changed sample data id=0 to ['a','a','c'] and id=1 to ['d','b','a']

sample df:
id     list_of_value
0      ['a','a','c'] #changed
1      ['d','b','a'] #changed
2      ['a','b','c']
3      ['a','b','c']

from collections import Counter

c_list =  df.list_of_value.map(Counter)
c_overlap = [[sum((c1 & c).values()) for c1 in c_list] for c in c_list]

df_final = pd.DataFrame(c_overlap) / df.list_of_value.str.len().to_numpy()[:,None]


 Out[208]:
          0         1         2         3
0  1.000000  0.333333  0.666667  0.666667
1  0.333333  1.000000  0.666667  0.666667
2  0.666667  0.666667  1.000000  1.000000
3  0.666667  0.666667  1.000000  1.000000
Andy L.
  • 24,909
  • 4
  • 17
  • 29
2

Updated

Since there are a lot of candidate solutions proposed, it seems like a good idea to do a timing analysis. I generated some random data with 12k rows as requested by the OP, keeping with the 3 elements per set but expanding the size of the alphabet available to populate the sets. This can be adjusted to match the actual data.

Let me know if you have a solution that you would like tested or updated.

Setup

import pandas as pd
import random

ALPHABET = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

def random_letters(n, n_letters=52):
    return random.sample(ALPHABET[:n_letters], n)

# Create 12k rows to test scaling.
df = pd.DataFrame([{'id': i, 'list_of_value': random_letters(3)} for i in range(12000)])

Current Winner

def method_quang(df): 
    s = pd.get_dummies(df.list_of_value.explode()).sum(level=0) 
    return s.dot(s.T).div(s.sum(1)) 

%time method_quang(df)                                                                                                                                                                                                               
# CPU times: user 10.5 s, sys: 828 ms, total: 11.3 s
# Wall time: 11.3 s
# ...
# [12000 rows x 12000 columns]

Contenders

def method_mcskinner(df):
    explode_df = df.set_index('id').list_of_value.explode().reset_index() 
    explode_df = explode_df.rename(columns={'list_of_value': 'value'}) 
    denom_df = explode_df.groupby('id').size().reset_index(name='denom') 
    numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y']) 
    numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer') 
    calc_df = numer_df.merge(denom_df, on='id') 
    calc_df['score'] = calc_df['numer'] / calc_df['denom'] 
    return calc_df.pivot('id', 'id_y', 'score').fillna(0) 

%time method_mcskinner(df)
# CPU times: user 29.2 s, sys: 9.66 s, total: 38.9 s
# Wall time: 29.6 s
# ...
# [12000 rows x 12000 columns]
def method_rishab(df): 
    vals = [[len(set(val1) & set(val2)) / len(val1) for val2 in df['list_of_value']] for val1 in df['list_of_value']]
    return pd.DataFrame(columns=df['id'], data=vals)

%time method_rishab(df)                                                                                                                                                                                                              
# CPU times: user 2min 12s, sys: 4.64 s, total: 2min 17s
# Wall time: 2min 18s
# ...
# [12000 rows x 12000 columns]
def method_fahad(df): 
    ids = list(df['id']) 
    range_of_ids = range(len(ids)) 

    def score_calculation(s_id1,s_id2): 
        s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0]) 
        s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0]) 
        # Resultant calculation s1&s2 
        return round(len(s1&s2)/len(s1) , 2) 

    dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids} 
    return pd.DataFrame(dic) 

# Stopped manually after running for more than 10 minutes.

Original post with solution details

It is possible to do this in pandas with a self-join.

As other answers have pointed out, the first step is to unpack the data into a longer form.

explode_df = df.set_index('id').list_of_value.explode().reset_index()
explode_df = explode_df.rename(columns={'list_of_value': 'value'})
explode_df
#     id value
# 0    0     a
# 1    0     b
# 2    0     c
# 3    1     d
# 4    1     b
# ...

From this table it is possible to compute the per-ID counts.

denom_df = explode_df.groupby('id').size().reset_index(name='denom')
denom_df
#    id  denom
# 0   0      3
# 1   1      3
# 2   2      3
# 3   3      3

And then comes the self-join, which happens on value column. This pairs IDs once for each intersecting value, so the paired IDs can be counted to get the intersection sizes.

numer_df = explode_df.merge(explode_df, on='value', suffixes=['', '_y'])
numer_df = numer_df.groupby(['id', 'id_y']).size().reset_index(name='numer')
numer_df
#     id  id_y  numer
# 0    0     0      3
# 1    0     1      2
# 2    0     2      3
# 3    0     3      3
# 4    1     0      2
# 5    1     1      3
# ...

These two can then be merged, and a score computed.

calc_df = numer_df.merge(denom_df, on='id')
calc_df['score'] = calc_df['numer'] / calc_df['denom']
calc_df
#     id  id_y  numer  denom     score
# 0    0     0      3      3  1.000000
# 1    0     1      2      3  0.666667
# 2    0     2      3      3  1.000000
# 3    0     3      3      3  1.000000
# 4    1     0      2      3  0.666667
# 5    1     1      3      3  1.000000
# ...

If you prefer the matrix form, that is possible with a pivot. This will be a much larger representation if the data is sparse.

calc_df.pivot('id', 'id_y', 'score').fillna(0)
# id_y         0         1         2         3
# id                                          
# 0     1.000000  0.666667  1.000000  1.000000
# 1     0.666667  1.000000  0.666667  0.666667
# 2     1.000000  0.666667  1.000000  1.000000
# 3     1.000000  0.666667  1.000000  1.000000
mcskinner
  • 2,620
  • 1
  • 11
  • 21
1

You can conver the list to a set and use the intersection function to check for overlap:

(only 1 apply function is used as you asked :-) )

(
    df.assign(s = df.list_of_value.apply(set))
    .pipe(lambda x: pd.DataFrame([[len(e&f)/len(e) for f in x.s] for e in x.s]))
)

    0           1           2           3
0   1.000000    0.666667    1.000000    1.000000
1   0.666667    1.000000    0.666667    0.666667
2   1.000000    0.666667    1.000000    1.000000
3   1.000000    0.666667    1.000000    1.000000
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
1

I would use product to get all combinations. Then we can check with numpy.isin and numpy.mean:

from itertools import product
l = len(df)
new_df = pd.DataFrame(data = np.array(list(map(lambda arr: np.isin(*arr),
                                                product(df['list_of_value'],
                                                        repeat=2))))
                               .mean(axis=1).reshape(l,-1),
                      index = df['id'],
                      columns=df['id'])

id         0         1         2         3
id                                        
0   1.000000  0.666667  1.000000  1.000000
1   0.666667  1.000000  0.666667  0.666667
2   1.000000  0.666667  1.000000  1.000000
3   1.000000  0.666667  1.000000  1.000000

Time sample

%%timeit
l = len(df)
new_df = pd.DataFrame(data = np.array(list(map(lambda arr: np.isin(*arr),
                                                product(df['list_of_value'],
                                                        repeat=2))))
                               .mean(axis=1).reshape(l,-1),
                      index = df['id'],
                      columns=df['id'])
594 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ansev
  • 30,322
  • 5
  • 17
  • 31
1

Should be fast , also consider the duplicate in the list

... import itertools
... from collections import Counter
... a=df.list_of_value.tolist()
... l=np.array([len(Counter(x[0]) & Counter(x[1]))for x in [*itertools.product(a,a)]]).reshape(len(df),-1)
... out=pd.DataFrame(l/df.list_of_value.str.len().values[:,None],index=df.id,columns=df.id)
... 
out
id         0         1         2         3
id                                        
0   1.000000  0.666667  1.000000  1.000000
1   0.666667  1.000000  0.666667  0.666667
2   1.000000  0.666667  1.000000  1.000000
3   1.000000  0.666667  1.000000  1.000000
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Yes! We're looking for a Cartesian product here, which is given in this answer. This can be achived without a for loop or a list comprehension

Let's add a new repeated value to our data frame df so that it looks like this:

df['key'] = np.repeat(1, df.shape[0])
df

  list_of_values  key
0      [a, b, c]    1
1      [d, b, c]    1
2      [a, b, c]    1
3      [a, b, c]    1

Next merge with itself

merged = pd.merge(df, df, on='key')[['list_of_values_x', 'list_of_values_y']]

This is how the merged frame looks like:

   list_of_values_x list_of_values_y
0         [a, b, c]        [a, b, c]
1         [a, b, c]        [d, b, c]
2         [a, b, c]        [a, b, c]
3         [a, b, c]        [a, b, c]
4         [d, b, c]        [a, b, c]
5         [d, b, c]        [d, b, c]
6         [d, b, c]        [a, b, c]
7         [d, b, c]        [a, b, c]
8         [a, b, c]        [a, b, c]
9         [a, b, c]        [d, b, c]
10        [a, b, c]        [a, b, c]
11        [a, b, c]        [a, b, c]
12        [a, b, c]        [a, b, c]
13        [a, b, c]        [d, b, c]
14        [a, b, c]        [a, b, c]
15        [a, b, c]        [a, b, c]

Then we apply desired function to each row using axis=1

values = merged.apply(lambda x: np.intersect1d(x[0], x[1]).shape[0] / len(x[1]), axis=1)

Reshaping this to get values in desired format

values.values.reshape(4, 4)
array([[1.        , 0.66666667, 1.        , 1.        ],
       [0.66666667, 1.        , 0.66666667, 0.66666667],
       [1.        , 0.66666667, 1.        , 1.        ],
       [1.        , 0.66666667, 1.        , 1.        ]])

Hope this helps :)

Pushkar Nimkar
  • 394
  • 3
  • 11