How to get a list of rows which exist in one dataframe and do not exist in the second one in Pandas

Question

I have two dataframes - one is database and another one is actual list. Database: list of skills

skills
skill1
skill2
skill3
skill4

List: list of pairs

worker	skills_of_worker
worker1	skill1
worker1	skill2
worker1	skill3
worker2	skill2
worker2	skill3
worker2	skill4
worker3	skill1
worker3	skill3
worker3	skill4

How can I get a list of pairs worker-skills, which are in Database, but not in list?:what I need

worker	skills_they_dont_have
worker1	skill4
worker2	skill1
worker3	skill2

I tried different ways of merging and dropind duplicates, but, because of actual list of workers il pretty long, the duplicates exists for skills.

I imagined there was a typo in your output – mozway Apr 12 '23 at 11:20 — mozway, Apr 12 '23 at 11:20

mozway · Answer 1 · 2023-04-12T11:20:06.357

I would use set operations (difference), then explode:

skills = {'skill1', 'skill2', 'skill3', 'skill4'}

out = (
  df.groupby('worker', as_index=False)['skills_of_worker']
   .agg(skills.difference)
   .explode('skills_of_worker')
   .rename(columns={'skills_of_worker': 'skills_they_dont_have'})
)

Output:

    worker skills_they_dont_have
0  worker1                skill4
1  worker2                skill1
2  worker3                skill2

score 0 · Accepted Answer · answered Apr 12 '23 at 11:42

One way to do this is to create a dataframe of all workers and skills (by cross joining them) and then left joining that to the workers df to find which combinations don't exist; that being the skills that each worker doesn't have:

workers = pd.DataFrame({
    'worker': ['worker1', 'worker1', 'worker1', 'worker2', 'worker2', 'worker2', 'worker3', 'worker3', 'worker3'],
    'skills_of_worker': ['skill1', 'skill2', 'skill3', 'skill2', 'skill3', 'skill4', 'skill1', 'skill3', 'skill4']
})
skills = pd.DataFrame({'skills': ['skill1', 'skill2', 'skill3', 'skill4']})

all_workers = pd.Series(workers['worker'].unique(), name='worker')
res = skills.merge(all_workers, how='cross').merge(workers, how='left', left_on=['worker', 'skills'], right_on=['worker', 'skills_of_worker'])
res = res[res['skills_of_worker'].isna()].drop('skills_of_worker', axis=1).rename(columns={'skills':'skills_they_dont_have'}).reset_index(drop=True)

Output:

  skills_they_dont_have   worker
0                skill1  worker2
1                skill2  worker3
2                skill4  worker1

How to get a list of rows which exist in one dataframe and do not exist in the second one in Pandas

2 Answers2