3

I m trying to select rows from a pandas dataframe by applying condition to a column (in form of logical expression).

Sample data frame looks like:

    id             userid                code
    0            645382311          12324234234
    1            645382311          -2434234242
    2            645382312          32536365654
    3            645382312          12324234234

...

For example, I expect next result by applying logical expressions for column 'code':

    case 1: (12324234234 OR -2434234242) AND NOT 32536365654
    case 2: (12324234234 AND -2434234242) OR NOT 32536365654
    must give a result for both cases:
    userid: 645382311

The logic above says: For case 1 - give me only those userid who has at least one of the values (12324234234 OR -2434234242) and doesn't have 32536365654 in the whole data frame. For case 2 - I need only those userid who has either both codes in data frame (12324234234 AND -2434234242) or any codes but not 32536365654.

The statement like below returns empty DataFrame:

    flt = df[(df.code == 12324234234) & (df.code == -2434234242)]
    print("flt: ", flt)

Result (and it make sens):

    flt:  Empty DataFrame

Would be appreciate for any hints to handle such cases.

Alex
  • 533
  • 4
  • 12
  • The conditions seem a little strange: for case 1, ``AND NOT 32536365654`` is redundant; for case 2, ``(12324234234 AND -2434234242)`` is redundant. – YS-L Sep 18 '15 at 08:11
  • Everything is ok in conditions. For case 1 I need only those userid who has at least one of the values (12324234234 OR -2434234242) and doesn't have 32536365654 in the whole data frame. For case 2 everything is ok, I need only those userid who has both codes in data frame or any codes but not 32536365654. – Alex Sep 18 '15 at 08:43
  • You could find the users with code 32536365654, then use the df[~df.userid.isin( )] method to select users that don't have that code. See http://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe – matt_s Sep 18 '15 at 09:51

1 Answers1

1

As a simple approach, I would transform your sample table into a boolean presence matrix, which would then allow you to perform the logic you need:

import pandas
sample = pandas.DataFrame([[645382311, 12324234234], [645382311, -2434234242], [645382312, 32536365654], [645382312, 12324234234]], columns=['userid', 'code'])
# Add a column of True values
sample['value'] = True
# Pivot to boolean presence matrix and remove MultiIndex
presence = sample.pivot(index='userid', columns='code').fillna(False)['value']
# Perform desired boolean tests
case1 = (presence[12324234234] | presence[-2434234242]) & ~(presence[32536365654])
case2 = (presence[12324234234] & presence[-2434234242]) | ~(presence[32536365654])

The case variables will contain the boolean test result for each userid.

pbarber
  • 139
  • 2
  • 8
  • Thank you, @pbarber! It works good. The only thing is that having cases definition in forms like `(-1111111AND2222222)NOT(-3333333)` I'm using **regexp** to convert them to the appropriate form `case = '((presence[-1111111] & presence[2222222]) &~ (presence[-3333333]))'` so to perform boolean tests. And since this case has type of **String**, the only way to evaluate it is to use `eval(case)`. Do you have any thoughts or maybe other availavble options to evaluate such cases? – Alex Sep 21 '15 at 07:34