0

I have two dataframes - one is database and another one is actual list. Database: list of skills

skills
skill1
skill2
skill3
skill4

List: list of pairs

worker skills_of_worker
worker1 skill1
worker1 skill2
worker1 skill3
worker2 skill2
worker2 skill3
worker2 skill4
worker3 skill1
worker3 skill3
worker3 skill4

How can I get a list of pairs worker-skills, which are in Database, but not in list?:what I need

worker skills_they_dont_have
worker1 skill4
worker2 skill1
worker3 skill2

I tried different ways of merging and dropind duplicates, but, because of actual list of workers il pretty long, the duplicates exists for skills.

mozway
  • 194,879
  • 13
  • 39
  • 75
A Sm
  • 3
  • 2

2 Answers2

0

I would use set operations (difference), then explode:

skills = {'skill1', 'skill2', 'skill3', 'skill4'}

out = (
  df.groupby('worker', as_index=False)['skills_of_worker']
   .agg(skills.difference)
   .explode('skills_of_worker')
   .rename(columns={'skills_of_worker': 'skills_they_dont_have'})
)

Output:

    worker skills_they_dont_have
0  worker1                skill4
1  worker2                skill1
2  worker3                skill2
mozway
  • 194,879
  • 13
  • 39
  • 75
0

One way to do this is to create a dataframe of all workers and skills (by cross joining them) and then left joining that to the workers df to find which combinations don't exist; that being the skills that each worker doesn't have:

workers = pd.DataFrame({
    'worker': ['worker1', 'worker1', 'worker1', 'worker2', 'worker2', 'worker2', 'worker3', 'worker3', 'worker3'],
    'skills_of_worker': ['skill1', 'skill2', 'skill3', 'skill2', 'skill3', 'skill4', 'skill1', 'skill3', 'skill4']
})
skills = pd.DataFrame({'skills': ['skill1', 'skill2', 'skill3', 'skill4']})

all_workers = pd.Series(workers['worker'].unique(), name='worker')
res = skills.merge(all_workers, how='cross').merge(workers, how='left', left_on=['worker', 'skills'], right_on=['worker', 'skills_of_worker'])
res = res[res['skills_of_worker'].isna()].drop('skills_of_worker', axis=1).rename(columns={'skills':'skills_they_dont_have'}).reset_index(drop=True)

Output:

  skills_they_dont_have   worker
0                skill1  worker2
1                skill2  worker3
2                skill4  worker1
Nick
  • 138,499
  • 22
  • 57
  • 95