7

I have a dataframe which you can build with this:

dflist=[['123',['abc','qw3','123']],
        ['ab12',['3e4r5','12we3','asd23','q2w3']]]
df=pd.DataFrame(dflist,columns=['check','checklist'])

And looks like this:

  check                    checklist
0   123              [abc, qw3, 123]
1  ab12  [3e4r5, 12we3, asd23, q2w3]

I want to check if the item in column "check" is in the list in column "checklist". So I want the resulting dataframe to look like:

  check                    checklist checkisin
0   123              [abc, qw3, 123]      True
1  ab12  [3e4r5, 12we3, asd23, q2w3]     False

I have tried several things including using .isin in various forms including apply/lambda. and directly.

This:

df['checkisin']=df.check.isin(df.checklist)

produces:

  check                    checklist  checkisin
0   123              [abc, qw3, 123]      False
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

which has two Falses.

Trying this: df['checkisin']=df.apply(lambda x:x.check.isin(x.checklist)) gives this error:

AttributeError: ("'Series' object has no attribute 'check'", 'occurred at index check')

Trying this:

df['checkisin']=df.apply(lambda x:x['check'] in x.checklist)

gives this error:

KeyError: ('check', 'occurred at index check')

I'm sure I'm missing something simple here. I know I could loop this, but looking for a Pandas Dataframe column wise solution as the DF I have is very large and trying to "most" efficiently handle.

Thanks!

clg4
  • 2,863
  • 6
  • 27
  • 32

4 Answers4

7

You have a column of lists, of course pandas doesn't have any functions that natively support operations on data structured this badly. If you want the most performance, I'd recommend a list comprehension:

df['checkisin'] = [c in l for c, l in zip(df['check'], df['checklist'])]
df
  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

You might consider implementing try-except error handling if you're worried about NaNs and type mismatches:

def check_isin(check, checklist):
    try:
        return check in checklist
    except TypeError:
        return np.NaN

df['checkisin'] = [
    check_isin(c, l) for c, l in zip(df['check'], df['checklist'])
]

Evidence suggests list comprehensions are the most desirable option for operations that cannot be vectorized.

PS, consider converting your column of lists into a column of sets if you plan to do membership tests a lot.


Here's an example of how this operation can be vectorized.

from itertools import chain

cl = df.pop('checklist')
df = (pd.DataFrame(df.reset_index().values.repeat(cl.str.len(), axis=0), 
                   columns=['group', *df.columns])
        .assign(checklist=list(chain.from_iterable(cl))))

df

   group check checklist
0      0   123       abc
1      0   123       qw3
2      0   123       123
3      1  ab12     3e4r5
4      1  ab12     12we3
5      1  ab12     asd23
6      1  ab12      q2w3
7      1  ab12       123

(df['check'] == df['checklist']).groupby(df.group).any()

group
0     True
1    False
dtype: bool
cs95
  • 379,657
  • 97
  • 704
  • 746
  • there are many use cases for having lists in a column. In my case, the lists can be very long so putting each item in its own column would not work. In addition the lists can be very different sizes. How would you recommend structuring the data in this case? Would it be better to just make each list a string and use .contains? – clg4 Jun 30 '19 at 02:48
  • @clg4 I don't think that'd work either, because you would need to do a row-wise comparison (str.contains just can't do that). I've edited my answer with my recommendation on how your data should be structured and how you can vectorise this with groupby. PS, you will need to add an extra column to identify groups. See above. – cs95 Jun 30 '19 at 02:53
  • I am on the road. I will look at these when I get back to the lab. Stay tuned and thx – clg4 Jun 30 '19 at 02:56
  • tried this and it seems to fail see edit in your answer – clg4 Jun 30 '19 at 14:14
  • @clg4 Can you explain please? – cs95 Jun 30 '19 at 14:16
  • 1
    @clg4 If you are trying to assign back you should use `transform`: `df['checkisin'] = (df['check'] == df['checklist']).groupby(df['group']).transform('any')` – cs95 Jun 30 '19 at 14:23
  • 1
    OK, all work. added timeits to all answers for 10,000 row df. answer 1)62.2u,2 with error check) 84.2u, 3 vectorized) 1200u. So for yours the first 2 look best. Thx and stay tuned for winner. – clg4 Jun 30 '19 at 14:41
4

I will using isin with Series, isin with Series will match the index first which is different from list

pd.DataFrame(df.checklist.tolist(),index=df.index).isin(df.check).any(1)
Out[496]: 
0     True
1    False
dtype: bool

Or

pd.DataFrame(df.checklist.tolist(),index=df.index).eq(df.check,0).any(1)
BENY
  • 317,841
  • 20
  • 164
  • 234
4

map

df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, df))])

  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False

If your dataframe has more columns, you can be more explicit

cols = ['check', 'checklist']
df.assign(checkisin=[*map(lambda s, x: s in x, *map(df.get, cols))])
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • timeit of 146u so slower but works! upvote but have to give @cs95 the win. Thanks . – clg4 Jun 30 '19 at 14:50
  • 1
    But I win on style, right?! Tell me I win on style, or my ego can’t stand it. I would have won the speed contest too if it wasn’t for that meddling @cs95 /drats – piRSquared Jun 30 '19 at 15:10
1

You could try:

df['checkisin'] = [v in df.checklist[i] for i, v in enumerate(df.check)]

or:

df['checkisin'] = [i in j for i, j in zip(df.check, df.checklist)]

or:

df['checkisin'] = list(map(lambda i, j: i in j, df.check, df.checklist))

or (when you prefer df.assign):

df.assign(checkisin=[*map(lambda i, j: i in j, df.check, df.checklist)])

Result:

  check                    checklist  checkisin
0   123              [abc, qw3, 123]       True
1  ab12  [3e4r5, 12we3, asd23, q2w3]      False
René
  • 4,594
  • 5
  • 23
  • 52
  • 1
    added timeits for your options on a 10000 row dataframe. All options work. times 1 enumerate) 628u, 2 zip) 125u, 3 map) 93.4u 4 assign) 238u. All work but all more time than @cs95. So upvote and thanks but cs95 the winner... – clg4 Jun 30 '19 at 14:46