How to Retain Columns of lists in a dataframe with a specific value?

Question

Hey I have a dataframe as shown

id     A      B
1      2     ['a', 'c', 'd']
3      4     ['s', 'z', 'a', 'e']
5      6     ['b', 'z', 'd']
7      8     ['a', 'g']

Now, I would like to extract all rows that have 'a' in column "B" Desired Output:

id     A      B
1      2     ['a', 'c', 'd']
3      4     ['s', 'z', 'a', 'e']
7      8     ['a', 'g']

Help regarding accomplishing the above in python using Pandas will be appreciated :)

Thank you in advance for the help :)

Just FYI apply is time cost function, carefully use it – BENY Nov 01 '19 at 02:43 — BENY, Nov 01 '19 at 02:43

score 1 · Answer 1 · answered Nov 01 '19 at 02:22

1

We can do

df[pd.DataFrame(df.B.tolist()).eq('a').any(1).values]

answered Nov 01 '19 at 02:22

BENY

317,841
20
164
234

2

[How to Answer](https://stackoverflow.com/help/how-to-answer) strongly recommends only answering well-asked questions. – Andreas Nov 01 '19 at 02:27
@Andreas I think the answer is good enough to explain itself , since no logic here – BENY Nov 01 '19 at 02:36
1

The answer is good enough, but the question lacks the code attempts from OP. Hence, there is no [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). – Andreas Nov 01 '19 at 02:57
@Andreas aha, I got u. The question quality is a problem for New user. – BENY Nov 01 '19 at 03:33
1

Yes, hence it's recommended to guide them to improve the question before providing answer to them – Andreas Nov 01 '19 at 04:30

ansev · Accepted Answer · 2019-11-01T04:02:50.833

1

Use Series.apply to performance a boolean indexing:

new_df=df[df['B'].apply(lambda x: 'a' in x)]
print(new_df)

   id  A             B
0   1  2     [a, c, d]
1   3  4  [s, z, a, e]
3   7  8        [a, g]

Detail:

df['B'].apply(lambda x: 'a'  in x)
0     True
1     True
2    False
3     True
Name: B, dtype: bool

Also you can use callable:

df.loc[lambda x: x.B.str.join(',').str.contains('a')]

Time Measure for 400 rows

%%timeit
df[pd.DataFrame(df.B.tolist()).eq('a').any(1).values]
3.72 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.loc[lambda x: x.B.str.join(',').str.contains('a')]
1.33 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df[df['B'].apply(lambda x: 'a' in x)]
786 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Nov 01 '19 at 04:02

answered Nov 01 '19 at 02:31

ansev

30,322
5
17
31

1

Do not use apply for this case for loop even better – BENY Nov 01 '19 at 02:42
FYI, https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code – BENY Nov 01 '19 at 03:30
thanks for the info . I already proposed an alternative solution – ansev Nov 01 '19 at 03:34
You can test the timing , i think str.join will little better – BENY Nov 01 '19 at 03:44
As you can see both methods are similar. Instead yours is significantly worse. But obviously this might not be repetitive @WeNYoBen. – ansev Nov 01 '19 at 03:51
What sample size you are using ? – BENY Nov 01 '19 at 03:58
400 rows, e same columns thah OP – ansev Nov 01 '19 at 04:03
for 40000 rows apply method is three times faster but callable (with str.contains an join) is similar your solution – ansev Nov 01 '19 at 04:06

score 1 · Answer 3 · answered Nov 01 '19 at 02:35

1

You can do it like this:

new_df = pd.DataFrame(columns = ["id", "A", "B"])

i=0
for index, row in df.iterrows():
    if "a" in row['B']:
        new_df.loc[i] = row
        i+=1

answered Nov 01 '19 at 02:35

Eltay

11
2

How to Retain Columns of lists in a dataframe with a specific value?

3 Answers3