I would like to identify if a set of records is not represented by a distinct list of values; so in this example of:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice'],
'sport' : ['football','football','football','football']}
df_c = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['1', '3', '5'],
'first_name': ['Alex', 'Allen', 'Ayoung'],
'last_name': ['Anderson', 'Ali', 'Atiches'],
'sport' : ['football','football','football']}
df_d = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
frames = [df_a,df_b,df_c,df_d]
frame = pd.concat(frames)
frame = frame.sort_values(by='subject_id')
raw_data = {
'sport':['soccer','football','softball']
}
sportlist = pd.DataFrame(raw_data,columns=['sport'])
Desired output: I would like to get a list of first_name and last_name pairs that do not play football. And also I would like be able to return a list of all the records since softball is not represented in the original list.
I tried using merge with how= outer, indicator=True options but since there is a record that plays soccer there is a match. And the '_right_only' yields no records since it was not populated in the original data.
Thanks, aem