1

Below is my code. I am trying to parse a DataFrame and store company matches. However, the if statement always returns true and everything is saved in the dataframe current_customers even though about 10 out of my 150 rows has a value > 97. Below my code is a sample of my data.

current_customers = pandas.DataFrame()
potential_customers = pandas.DataFrame()
for i in range(0, len(FDA_useful_companies_bing)):
    if combined_data['match token sort'].iloc[i] or combined_data['match ratio'].iloc[i] or combined_data['match partial ratio'].iloc[i] > 97:
        current_customers = current_customers.append(combined_data.ix[i,4::])
    else:
        potential_customers = potential_customers.append(combined_data.ix[i,4::])

Sample of my data

Company                             City            State       ZIP     FDA Company                 FDA City            FDA State   FDA ZIP Token sort ratio              match token sort  Ratio                           match ratio    Partial Ratio            match partial ratio
NOVARTIS                            Larchwood       IA          51241   HELGET GAS PRODUCTS INC     Kansas City         MO          64116   AIR PRODUCTS  CHEMICALS INC   73                OCEANIC MEDICAL PRODUCTS INC    59             LUCAS INC                78
BOEHRINGER INGELHEIM VETMEDICA INC  Sioux Center    IA          51250   SOUTHWEST TECHNOLOGIES INC  North Kansas City   MO          64116   SOUTHWEST TECHNOLOGIES        100               SOUTHWEST TECHNOLOGIES          92             SOUTHWEST TECHNOLOGIES   100

EDIT: Additionally, if there is a more efficient way to do this, I would love to hear.

Jstuff
  • 1,266
  • 2
  • 16
  • 27

2 Answers2

2

IIUC you can just do :

current_customer = combined_data[(combined_data[['match token sort','match ratio','match partial ratio']] > 97).any(axis=1)]

potential_customer = combined_data[(combined_data[['match token sort','match ratio','match partial ratio']] <= 97).all(axis=1)]

What you tried short circuits because any non-zero value will evaluate to True as it's not comparing all the terms against the last numerical value as you expected:

if combined_data['match token sort'].iloc[i] or combined_data['match ratio'].iloc[i] or combined_data['match partial ratio'].iloc[i] > 97:

So this is equivalent to:

if some_val or another_val or last_val > 95

so here if some_val is non-zero or another_val is non_zero then the entire statement evaluates to True

You can see this in a simplified case:

In [83]:
x = 1    ​
if 5 or x > 95:
    print('True')
else:
    print('False')

this outputs:

True

With just a single comparison:

In [85]:
if 5 > 95:
    print('True')
else:
    print('False')

outputs:

False

but with each value compared with the target value:

In [87]:
x=1
if 5 > 95 or x > 95:
    print('True')
else:
    print('False')

this now prints:

False

but the real point here is to not loop at all, you can sub-select from your df by passing a list of the cols of interest, you can then compare the entire df against your scalar value and use any(axis=1) to generate a boolean mask and use this to mask the df to return you the current customers, you then invert the comparison and use all(axis=1) to find the rows where none of the cols satisfy your previous comparison to filter the df for the potential customers

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • A ha I knew there had to be a simpler way to do this. I appreciate taking the time to explain why it always evaluates as true with your example! – Jstuff Jul 26 '16 at 16:04
  • The `axis=1` command is hard to understand though. – Jstuff Jul 26 '16 at 16:06
  • 1
    the `axis=1` param indicates we want the comparison to be done row-wise instead of column-wise which would be `axis=0` you should try changing it from `1` to `0` to see the difference – EdChum Jul 26 '16 at 16:30
0

Your problem is the if statement, as you suspected:

if combined_data['match token sort'].iloc[i] or combined_data['match ratio'].iloc[i] or combined_data['match partial ratio'].iloc[i] > 97:

You're asking if the expression "combined_data['match token sort'].iloc[i]" is true, which it is a number > 0, so it is a truthey value according to Python. Thus, the entire expression returns True.

I'll add parenthesis to make it more clear how Python is interpreting this line of code:

if (combined_data['match token sort'].iloc[i]) or 
    (combined_data['match ratio'].iloc[i]) or 
    (combined_data['match partial ratio'].iloc[i] > 97):

Python is evaluating the statements in the parenthesis separately, and Python considers any non-zero number to be a "truthey" value, and thus used as a conditional it returns True. Here's a corrected expression:

if (combined_data['match token sort'].iloc[i]) > 97 or 
        (combined_data['match ratio'].iloc[i]) > 97 or 
        (combined_data['match partial ratio'].iloc[i] > 97):

Now Python will each operation as a comparison operation as you intended.

James
  • 2,843
  • 1
  • 14
  • 24