Pandas: Check value exist in a Column, which is stored as list

Question

I have the below data frame. The status column stores the value as a list.

df

   STATUS     
1 [REQUESTED, RECEIVED]
2 [XYZ]
3 [RECEIVED]

When I try the below logic:

df['STATUS'].str.upper().isin(['RECEIVED'])

It gives me

1 False
2 False
3 False

But I am expecting

1 True
2 False
3 True

as we have the value RECEIVED at rows 1 and 3

It looks like you have a typo in 3. - it should be RECEIVED and not RECIEVED. — Ami Tavory, Apr 06 '18 at 12:51

Ami Tavory · Answer 1 · 2018-04-06T13:02:33.877

1

It's possible you mean something like

>>> df.STATUS.astype(str).str.upper().str.contains('RECEIVED')
1 True
2 False
3 False

(Your example has a typo, incidentally - 1. has RECEIVED and 3. has RECIEVED.)

as isin is the opposite of what is meant by your example.

edited Apr 06 '18 at 13:02

answered Apr 06 '18 at 12:45

Ami Tavory

74,578
11
141
185

it gives me the error `Series object has no attribute contains` – Tom J Muthirenthi Apr 06 '18 at 12:50
Are you sure you placed `.str` after `upper()`? – Ami Tavory Apr 06 '18 at 12:53
but in this question i saw something similar about isin https://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe – Tom J Muthirenthi Apr 06 '18 at 15:54
@TomJMuthirenthi Yes, that is the exact opposite than this question, no? In that question, each of the series elements is a single string, and the point is to check if this string is in a list. Here each series element is a list, and the point is to check if the list *contains* a string. LMK if I'm missing something. – Ami Tavory Apr 06 '18 at 16:02
So, how can we write the equivalent for a df `select * from df where col in (5,6)` ? Please help. – Tom J Muthirenthi Apr 06 '18 at 16:06
@TomJMuthirenthi That would indeed be using `isin`, but it is the opposite case of your question here, AFAIU. In `SELECT * .. WHERE col IN (5, 6)`, `col` is implicitly a column of scalars. – Ami Tavory Apr 06 '18 at 16:14
So you mean to say, since STATUS column holds list, it is causing the confusion? – Tom J Muthirenthi Apr 06 '18 at 16:35
@TomJMuthirenthi Yes. In this case `STATUS` column holds lists, and you're checking if each one *contains* something, not if it's *contained in* something, AFAIU. – Ami Tavory Apr 06 '18 at 17:11
In that case should it print true for case 3, right? – Tom J Muthirenthi Apr 07 '18 at 05:07

jdehesa · Answer 2 · 2018-04-06T13:09:08.803

It's hard to operate directly with list values. You can concatenate the strings into one, using some separator character, and then check the condition:

import pandas as pd

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
                  index=[1, 2, 3])
print(df['STATUS'].str.join('|').str.contains('RECEIVED'))

Output:

1     True
2    False
3     True
Name: STATUS, dtype: bool

A more efficient option would be to replace the strings with numerical flags. This can be done really nicely since Python 3.6 using enum.Flag.

import enum
import pandas as pd

class Status(enum.Flag):
    REQUESTED = enum.auto()
    RECEIVED = enum.auto()
    XYZ = enum.auto()

df = pd.DataFrame({'STATUS': [Status.REQUESTED | Status.RECEIVED, Status.XYZ, Status.RECEIVED]}, index=[1, 2, 3])
print(df['STATUS'] & Status.RECEIVED)

Or, if you already have a data frame with strings:

import enum
import pandas as pd
from functools import reduce

class Status(enum.Flag):
    REQUESTED = enum.auto()
    RECEIVED = enum.auto()
    XYZ = enum.auto()

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]}, index=[1, 2, 3])
df['STATUS_ENUM'] = df['STATUS'].apply(lambda v: reduce(lambda a, b: a | Status[b], v, Status(0)))
print(df['STATUS_ENUM'] & Status.RECEIVED)

I am reconsidering the usage of `contains` as it fails to distinguish between `RECEIVED` and `RECEIVED CASH` — Tom J Muthirenthi, Apr 06 '18 at 18:26
@TomJMuthirenthi Yes, this approach has that kind of limitations... You could do something like `contains('RECEIVED|')`, but you would need to append a `|` to each row after `join`... — jdehesa, Apr 06 '18 at 18:37

James · Accepted Answer · 2018-04-06T19:23:35.517

1

For a simple check like this, you can join the list of strings and use contains.

EDIT: To account for the difference between RECEIVED and RECEIVED CASH, you can join the lists with a unique character (such as '=') AND surround the resulting string with the same character, and then check for =RECEIVED=.

('=' + df['STATUS'].str.join('=') + '=').str.contains('=RECEIVED=')

edited Apr 06 '18 at 19:23

answered Apr 06 '18 at 12:50

James

32,991
4
47
70

I am reconsidering the usage of contains as it fails to distinguish between ~RECEIVED~ and ~RECEIVED CASH~ – Tom J Muthirenthi Apr 06 '18 at 16:33

score 1 · Answer 4 · answered Apr 06 '18 at 13:07

1

Data from jde

df = pd.DataFrame({'STATUS': [['REQUESTED', 'RECEIVED'], ['XYZ'], ['RECEIVED']]},
                  index=[1, 2, 3])
df.STATUS.apply(lambda x : 'RECEIVED' in x)
Out[11]: 
1     True
2    False
3     True
Name: STATUS, dtype: bool

answered Apr 06 '18 at 13:07

BENY

317,841
20
164
234

Pandas: Check value exist in a Column, which is stored as list

4 Answers4