-1

I'm trying to compare 4 columns in a pandas dataframe and populate 5'th column based on the result. In normal SQL it would be something like this:

if speciality_new is null and location_new is null then 'No match found'
elif specialty <> specialty_new and location <> location_new then 'both are different'
elif specialty_new is null then 'specialty not found'
elif location_new is null then 'location not found'
else 'true'

I read that this could be achieved using np.where but my code is failing. Can someone please advice me what I'm doing wrong. This is what I wrote:

masterDf['Match'] = np.where(
    masterDf[speciality_new].isnull() & masterDf[location_new].isnull(), 'No match found',
    masterDf[speciality] != masterDf[speciality_new] & masterDf[location] != masterDf[location_new], 'Both specialty and location didnt match',
    masterDf[speciality] != masterDf[speciality_new], 'Specialty didnt match',
    masterDf[location] != masterDf[location_new], 'Location didnt match',
    True)

The error message is TypeError: unsupported operand type(s) for &: 'str' and 'str' which doesn't make any sense since '&' is the syntax for 'and'

dfsample is what I have and dfFinal is what I want

dfsample = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})

dfFinal = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida'],
       'match': ['TRUE', 'location didn’t match', 'specialty didn’t match', 'both specialty and location didn’t match', 'specialty didn’t match']})
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
fellowCoder
  • 69
  • 10
  • 'and' is the syntax for and. – pakpe Jan 19 '21 at 23:03
  • Even 'and' didnt work. this is what is said in stack overflow: https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o – fellowCoder Jan 19 '21 at 23:09
  • kindly share sample dataframe, with expected output – sammywemmy Jan 20 '21 at 01:07
  • You need to use the open and close parenthesis to ensure you got the query correctly mapped. – Joe Ferndz Jan 20 '21 at 03:22
  • @sammywemmy added the sample dataframe and also the final output that I want – fellowCoder Jan 20 '21 at 03:23
  • @JoeFerndz I tried adding paranthesis but still getting the same error. Also, thank you for the edits. – fellowCoder Jan 20 '21 at 03:38
  • You're using multiple conditions in the `numpy.where` function, but it expects just one, alongside its output as parameters. You should use `numpy.select`, a nested `numpy.where` (with calls of `numpy.where` as the last argument from the function up to the default desired output), or simply use a Python built-in if-elif-else statement with `map` or `apply`, as provided in the answers. – Cainã Max Couto-Silva Jan 20 '21 at 07:53

2 Answers2

2

For analyzing multiple conditions using numpy, you're better using numpy.select, where you should specify the conditions, the expected output for each condition, and a default output, just like an if-elif-else statement:

import numpy as np

condlist = [
    dfsample['speciality_new'].isnull() & dfsample['location_new'].isnull(),
    dfsample['speciality'].ne(dfsample['speciality_new']) & 
    dfsample['location'].ne(dfsample['location_new']),
    dfsample['speciality'].ne(dfsample['speciality_new']),
    dfsample['location'].ne(dfsample['location_new']),
]

choicelist = [
    'No match found',
    'Both specialty and location didnt match',
    'Specialty didnt match',
    'Location didnt match'
]

dfsample['match'] = np.select(condlist, choicelist, default=True)
print(dfsample)

where ne stands by "not equal" (you can simply use !=).


Output:

   ID speciality location speciality_new location_new                                    match
0   1     doctor    texas         doctor        texas                                     True
1   2      nurse       dc          nurse       alaska                     Location didnt match
2   3    patient  georgia       director      georgia                    Specialty didnt match
3   4     driver                   nurse     maryland  Both specialty and location didnt match
4   5   director  florida                     florida                    Specialty didnt match
Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35
  • Hi @caina, thank you for helping out with this problem. I'm using the same code - with slight modification - but getting an error ```ValueError: Wrong number of items passed 63311, placement implies 1``` Can you please give any hint what can be the issue. 63311 is the number of rows in dataframe. – fellowCoder Feb 23 '21 at 15:04
  • The code is ```condList = [ dfsample['Address'].str.extract('(\d+)').isna(), dfsample['Address'].str.extract('(\d+)').replace(np.nan, -1, regex=True).astype(int)[0].eq(dfsample['street']), ] choiceList = [False, True] dfsample['match'] = np.select(condList, choiceList, default=False)``` – fellowCoder Feb 23 '21 at 15:06
  • Your first condition from `condList` returns a data.frame, where it's expected a Series or one-dimensional numpy array. If you place a `[0]` after it (like `dfsample['Address'].str.extract('(\d+)').isna()[0])` it may work. – Cainã Max Couto-Silva Feb 23 '21 at 23:36
1

Here an alternate way to solve for it without the np.where. I am using apply function.

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', np.NaN],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})

print (df)

def master_check(x):
    #print (x)
    if    (pd.isnull(x['speciality_new'])) & (pd.isnull(x['location_new'])): return 'No match found'
    elif  (x['speciality'] != x['speciality_new']) & (x['location'] != x['location_new']): return 'Both specialty and location didnt match'
    elif  x['speciality'] != x['speciality_new']: return 'Specialty didnt match'
    elif  x['location'] != x['location_new']: return 'Location didnt match'
    else: return True

df['Match'] = df.apply(master_check,axis=1)

The output will be:

ID speciality location speciality_new location_new
0   1     doctor    texas         doctor        texas
1   2      nurse       dc          nurse       alaska
2   3    patient  georgia       director      georgia
3   4     driver                   nurse     maryland
4   5   director  florida            NaN      florida


ID speciality  ... location_new                                    Match
0   1     doctor  ...        texas                                     True
1   2      nurse  ...       alaska                     Location didnt match
2   3    patient  ...      georgia                    Specialty didnt match
3   4     driver  ...     maryland  Both specialty and location didnt match
4   5   director  ...      florida                    Specialty didnt match

If you do want to use numpy.where(), then you have to consider each False statement as a separate numpy.where(). To implement it using numpy.where(), you have to do it this way.

import pandas as pd
import numpy as np

masterDf = pd.DataFrame({'ID': [1, 2, 3, 4, 5],
       'speciality': ['doctor', 'nurse', 'patient', 'driver', 'director'],
       'location': ['texas', 'dc', 'georgia', '', 'florida'],
       'speciality_new' : ['doctor', 'nurse', 'director', 'nurse', ''],
       'location_new': ['texas', 'alaska', 'georgia', 'maryland', 'florida']})


masterDf['Match'] = np.where(
    ((masterDf.speciality_new.isnull()) & (masterDf.location_new.isnull())), 'No match found',
    np.where(((masterDf.speciality != masterDf.speciality_new) & (masterDf.location != masterDf.location_new)), 'Both specialty and location didnt match',
    np.where((masterDf.speciality != masterDf.speciality_new), 'Specialty didnt match',
    np.where((masterDf.location != masterDf.location_new), 'Location didnt match',
    True))))

print (masterDf)

The output will be:

   ID speciality  ... location_new                                    Match
0   1     doctor  ...        texas                                     True
1   2      nurse  ...       alaska                     Location didnt match
2   3    patient  ...      georgia                    Specialty didnt match
3   4     driver  ...     maryland  Both specialty and location didnt match
4   5   director  ...      florida                    Specialty didnt match
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33