Pandas - Check if column of strings in column of lists row wise

Question

I have a dataframe which you can build with this:

dflist=[['123',['abc','qw3','123']],
        ['ab12',['3e4r5','12we3','asd23','q2w3']]]
df=pd.DataFrame(dflist,columns=['check','checklist'])

And looks like this:

  check                    checklist
0   123              [abc, qw3, 123]
1  ab12  [3e4r5, 12we3, asd23, q2w3]

I want to check if the item in column "check" is in the list in column "checklist". So I want the resulting dataframe to look like:

  check                    checklist checkisin
0   123              [abc, qw3, 123]      True
1  ab12  [3e4r5, 12we3, asd23, q2w3]     False

I have tried several things including using .isin in various forms including apply/lambda. and directly.

This:

df['checkisin']=df.check.isin(df.checklist)

produces:

  check                    checklist  checkisin
0   123              [abc, qw3, 123]      False
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

which has two Falses.

Trying this: df['checkisin']=df.apply(lambda x:x.check.isin(x.checklist)) gives this error:

AttributeError: ("'Series' object has no attribute 'check'", 'occurred at index check')

Trying this:

df['checkisin']=df.apply(lambda x:x['check'] in x.checklist)

gives this error:

KeyError: ('check', 'occurred at index check')

I'm sure I'm missing something simple here. I know I could loop this, but looking for a Pandas Dataframe column wise solution as the DF I have is very large and trying to "most" efficiently handle.

Thanks!

cs95 · Accepted Answer · 2019-06-30T20:52:19.213

7

You have a column of lists, of course pandas doesn't have any functions that natively support operations on data structured this badly. If you want the most performance, I'd recommend a list comprehension:

df['checkisin'] = [c in l for c, l in zip(df['check'], df['checklist'])]
df
  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

You might consider implementing try-except error handling if you're worried about NaNs and type mismatches:

def check_isin(check, checklist):
    try:
        return check in checklist
    except TypeError:
        return np.NaN

df['checkisin'] = [
    check_isin(c, l) for c, l in zip(df['check'], df['checklist'])
]

Evidence suggests list comprehensions are the most desirable option for operations that cannot be vectorized.

PS, consider converting your column of lists into a column of sets if you plan to do membership tests a lot.

Here's an example of how this operation can be vectorized.

from itertools import chain

cl = df.pop('checklist')
df = (pd.DataFrame(df.reset_index().values.repeat(cl.str.len(), axis=0), 
                   columns=['group', *df.columns])
        .assign(checklist=list(chain.from_iterable(cl))))

df

   group check checklist
0      0   123       abc
1      0   123       qw3
2      0   123       123
3      1  ab12     3e4r5
4      1  ab12     12we3
5      1  ab12     asd23
6      1  ab12      q2w3
7      1  ab12       123

(df['check'] == df['checklist']).groupby(df.group).any()

group
0     True
1    False
dtype: bool

edited Jun 30 '19 at 20:52

answered Jun 30 '19 at 01:57

cs95

379,657
97
704
746

there are many use cases for having lists in a column. In my case, the lists can be very long so putting each item in its own column would not work. In addition the lists can be very different sizes. How would you recommend structuring the data in this case? Would it be better to just make each list a string and use .contains? – clg4 Jun 30 '19 at 02:48
@clg4 I don't think that'd work either, because you would need to do a row-wise comparison (str.contains just can't do that). I've edited my answer with my recommendation on how your data should be structured and how you can vectorise this with groupby. PS, you will need to add an extra column to identify groups. See above. – cs95 Jun 30 '19 at 02:53
I am on the road. I will look at these when I get back to the lab. Stay tuned and thx – clg4 Jun 30 '19 at 02:56
tried this and it seems to fail see edit in your answer – clg4 Jun 30 '19 at 14:14
@clg4 Can you explain please? – cs95 Jun 30 '19 at 14:16
1

@clg4 If you are trying to assign back you should use `transform`: `df['checkisin'] = (df['check'] == df['checklist']).groupby(df['group']).transform('any')` – cs95 Jun 30 '19 at 14:23
1

OK, all work. added timeits to all answers for 10,000 row df. answer 1)62.2u,2 with error check) 84.2u, 3 vectorized) 1200u. So for yours the first 2 look best. Thx and stay tuned for winner. – clg4 Jun 30 '19 at 14:41

BENY · Answer 2 · 2019-06-30T02:05:47.250

4

I will using isin with Series, isin with Series will match the index first which is different from list

pd.DataFrame(df.checklist.tolist(),index=df.index).isin(df.check).any(1)
Out[496]: 
0     True
1    False
dtype: bool

Or

pd.DataFrame(df.checklist.tolist(),index=df.index).eq(df.check,0).any(1)

edited Jun 30 '19 at 02:05

answered Jun 30 '19 at 02:02

BENY

317,841
20
164
234

@cs95 I check it it is same as eq ...so should be work – BENY Jun 30 '19 at 02:05
I didn't know df.isin only checks row-wise... TIL! – cs95 Jun 30 '19 at 02:05
1

@cs95 only when pass the series :-) – BENY Jun 30 '19 at 02:06
Interesting, from my tests it seems eq is slightly faster than isin, but both are slightly slower than list comps. – cs95 Jun 30 '19 at 02:08
@cs95 that is true match the index in the back end is almost equal to for loop – BENY Jun 30 '19 at 02:10
All work, added timeit for 10000 row df was 124u, so slower than @cs95s best options. Good job though thanks – clg4 Jun 30 '19 at 14:43

score 4 · Answer 3 · answered Jun 30 '19 at 03:30

4

`map`

df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, df))])

  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

If your dataframe has more columns, you can be more explicit

cols = ['check', 'checklist']
df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, cols))])

answered Jun 30 '19 at 03:30

piRSquared

285,575
57
475
624

timeit of 146u so slower but works! upvote but have to give @cs95 the win. Thanks . – clg4 Jun 30 '19 at 14:50
1

But I win on style, right?! Tell me I win on style, or my ego can’t stand it. I would have won the speed contest too if it wasn’t for that meddling @cs95 /drats – piRSquared Jun 30 '19 at 15:10

René · Answer 4 · 2019-06-30T09:09:11.187

1

You could try:

df['checkisin'] = [v in df.checklist[i] for i, v in enumerate(df.check)]

or:

df['checkisin'] = [i in j for i, j in zip(df.check, df.checklist)]

or:

df['checkisin'] = list(map(lambda i, j: i in j, df.check, df.checklist))

or (when you prefer df.assign):

df.assign(checkisin=[*map(lambda i, j: i in j, df.check, df.checklist)])

Result:

  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

edited Jun 30 '19 at 09:09

answered Jun 30 '19 at 07:44

René

4,594
5
23
52

1

added timeits for your options on a 10000 row dataframe. All options work. times 1 enumerate) 628u, 2 zip) 125u, 3 map) 93.4u 4 assign) 238u. All work but all more time than @cs95. So upvote and thanks but cs95 the winner... – clg4 Jun 30 '19 at 14:46

Pandas - Check if column of strings in column of lists row wise

4 Answers4

map

`map`