finding non matching records in pandas

Question

I would like to identify if a set of records is not represented by a distinct list of values; so in this example of:

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'],
        'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])

raw_data = {
        'subject_id': ['9', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'],
        'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])

raw_data = {
        'subject_id': ['9', '5', '6', '7'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice'],
        'sport' : ['football','football','football','football']}
df_c = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])

raw_data = {
        'subject_id': ['1', '3', '5'],
        'first_name': ['Alex', 'Allen', 'Ayoung'], 
        'last_name': ['Anderson', 'Ali',  'Atiches'],
       'sport' : ['football','football','football']}
df_d = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])

frames = [df_a,df_b,df_c,df_d]

frame = pd.concat(frames)
frame = frame.sort_values(by='subject_id')

raw_data = {
        'sport':['soccer','football','softball']
        }
sportlist = pd.DataFrame(raw_data,columns=['sport'])

Desired output: I would like to get a list of first_name and last_name pairs that do not play football. And also I would like be able to return a list of all the records since softball is not represented in the original list.

I tried using merge with how= outer, indicator=True options but since there is a record that plays soccer there is a match. And the '_right_only' yields no records since it was not populated in the original data.

Thanks, aem

You should really try to minimize all this code to a simple demonstration of what you are trying to do. — user3483203, May 02 '18 at 18:11
Is there a chance that all you want is `frame[frame.sport != 'football']` ? — Stefan Falk, May 02 '18 at 18:16
Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) — Stefan Falk, May 02 '18 at 18:20

score 0 · Answer 1 · answered May 02 '18 at 18:18

If you only want to get the names of people who do not play football all you need to do is:

frame[frame.sport != 'football']

Which would select only those persons who are not playing football.

If it has to be a list you can further call to_records(index=False)

frame[frame.sport != 'football'][['first_name', 'last_name']].to_records(index=False)

which returns a list of tuples:

[('Alex', 'Anderson'), ('Amy', 'Ackerman'), ('Allen', 'Ali'),
 ('Alice', 'Aoni'), ('Brian', 'Black'), ('Ayoung', 'Atiches'),
 ('Bran', 'Balwner'), ('Bryce', 'Brice'), ('Betty', 'Btisan'),
 ('Billy', 'Bonder')]

score 0 · Answer 2 · answered May 02 '18 at 18:20

You can also use .loc indexer in pandas

frame.loc[frame['sport'].ne('football'), ['first_name','last_name']].values.tolist()

[['Alex', 'Anderson'],
 ['Amy', 'Ackerman'],
 ['Allen', 'Ali'],
 ['Alice', 'Aoni'],
 ['Brian', 'Black'],
 ['Ayoung', 'Atiches'],
 ['Bran', 'Balwner'],
 ['Bryce', 'Brice'],
 ['Betty', 'Btisan'],
 ['Billy', 'Bonder']]

finding non matching records in pandas

2 Answers2