2

I have the below data frame. The status column stores the value as a list.

df

   STATUS     
1 [REQUESTED, RECEIVED]
2 [XYZ]
3 [RECEIVED]

When I try the below logic:

df['STATUS'].str.upper().isin(['RECEIVED'])

It gives me

1 False
2 False
3 False

But I am expecting

1 True
2 False
3 True

as we have the value RECEIVED at rows 1 and 3

Tom J Muthirenthi
  • 3,028
  • 7
  • 40
  • 60

4 Answers4

1

It's possible you mean something like

>>> df.STATUS.astype(str).str.upper().str.contains('RECEIVED')
1 True
2 False
3 False

(Your example has a typo, incidentally - 1. has RECEIVED and 3. has RECIEVED.)

as isin is the opposite of what is meant by your example.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • it gives me the error `Series object has no attribute contains` – Tom J Muthirenthi Apr 06 '18 at 12:50
  • Are you sure you placed `.str` after `upper()`? – Ami Tavory Apr 06 '18 at 12:53
  • but in this question i saw something similar about isin https://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe – Tom J Muthirenthi Apr 06 '18 at 15:54
  • @TomJMuthirenthi Yes, that is the exact opposite than this question, no? In that question, each of the series elements is a single string, and the point is to check if this string is in a list. Here each series element is a list, and the point is to check if the list *contains* a string. LMK if I'm missing something. – Ami Tavory Apr 06 '18 at 16:02
  • So, how can we write the equivalent for a df `select * from df where col in (5,6)` ? Please help. – Tom J Muthirenthi Apr 06 '18 at 16:06
  • @TomJMuthirenthi That would indeed be using `isin`, but it is the opposite case of your question here, AFAIU. In `SELECT * .. WHERE col IN (5, 6)`, `col` is implicitly a column of scalars. – Ami Tavory Apr 06 '18 at 16:14
  • So you mean to say, since STATUS column holds list, it is causing the confusion? – Tom J Muthirenthi Apr 06 '18 at 16:35
  • @TomJMuthirenthi Yes. In this case `STATUS` column holds lists, and you're checking if each one *contains* something, not if it's *contained in* something, AFAIU. – Ami Tavory Apr 06 '18 at 17:11
  • In that case should it print true for case 3, right? – Tom J Muthirenthi Apr 07 '18 at 05:07
1

It's hard to operate directly with list values. You can concatenate the strings into one, using some separator character, and then check the condition:

import pandas as pd

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
                  index=[1, 2, 3])
print(df['STATUS'].str.join('|').str.contains('RECEIVED'))

Output:

1     True
2    False
3     True
Name: STATUS, dtype: bool

A more efficient option would be to replace the strings with numerical flags. This can be done really nicely since Python 3.6 using enum.Flag.

import enum
import pandas as pd

class Status(enum.Flag):
    REQUESTED = enum.auto()
    RECEIVED = enum.auto()
    XYZ = enum.auto()

df = pd.DataFrame({'STATUS': [Status.REQUESTED | Status.RECEIVED, Status.XYZ, Status.RECEIVED]}, index=[1, 2, 3])
print(df['STATUS'] & Status.RECEIVED)

Or, if you already have a data frame with strings:

import enum
import pandas as pd
from functools import reduce

class Status(enum.Flag):
    REQUESTED = enum.auto()
    RECEIVED = enum.auto()
    XYZ = enum.auto()

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]}, index=[1, 2, 3])
df['STATUS_ENUM'] = df['STATUS'].apply(lambda v: reduce(lambda a, b: a | Status[b], v, Status(0)))
print(df['STATUS_ENUM'] & Status.RECEIVED)
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • I am reconsidering the usage of `contains` as it fails to distinguish between `RECEIVED` and `RECEIVED CASH` – Tom J Muthirenthi Apr 06 '18 at 18:26
  • @TomJMuthirenthi Yes, this approach has that kind of limitations... You could do something like `contains('RECEIVED|')`, but you would need to append a `|` to each row after `join`... – jdehesa Apr 06 '18 at 18:37
1

For a simple check like this, you can join the list of strings and use contains.

EDIT: To account for the difference between RECEIVED and RECEIVED CASH, you can join the lists with a unique character (such as '=') AND surround the resulting string with the same character, and then check for =RECEIVED=.

('=' + df['STATUS'].str.join('=') + '=').str.contains('=RECEIVED=')
James
  • 32,991
  • 4
  • 47
  • 70
1

Data from jde

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
                  index=[1, 2, 3])
df.STATUS.apply(lambda x : 'RECEIVED' in x)
Out[11]: 
1     True
2    False
3     True
Name: STATUS, dtype: bool
BENY
  • 317,841
  • 20
  • 164
  • 234