0

I wonder how to create new columns in Pandas dataframe with flags if the element in a list existing in another column? updated: The list will be updated frequently and can be very dynamic and long. Is there any way to create flags based on a dynamic list? Thank you.

Thank you so much.

list =['apple', 'banana', 'peach']

Input dataframe:

enter image description here

Output dataframe:

enter image description here

Corralien
  • 109,409
  • 8
  • 28
  • 52
lionking19063
  • 79
  • 1
  • 7
  • The list will be updated frequently and can be very dynamic and long. Is there any way to create dynamic flags based on a dynamic list? Thank you. – lionking19063 Feb 01 '22 at 18:33

4 Answers4

4

Try to explode fruit column into rows of fruit name then pivot your dataframe:

out = df.join(df['fruit'].str.split().explode().reset_index().assign(count=1)
                         .pivot_table('count', 'index', 'fruit', fill_value=0)
                         .add_prefix('flag_'))

Output:

>>> out
                fruit  flag_apple  flag_banana  flag_peach
0        apple banana           1            1           0
1         apple peach           1            0           1
2               peach           0            0           1
3              banana           0            1           0
4               apple           1            0           0
5  apple banana peach           1            1           1
Corralien
  • 109,409
  • 8
  • 28
  • 52
2

Here's a quick implementation of what I think you're trying to do.

import pandas as pd

fruits = ['apple','banana','peach'] # list of fruit
df = pd.DataFrame(                  # build dataframe
    {'fruit':[
        'apple banana',
        'apple peach',
        'peach',
        'banana',
        'apple',
        'apple banana peach']})

for f in fruits:
    df[f'flag_{f}'] = df['fruit'].str.count(f)
print(df)

Resulting output:

                fruit  flag_apple  flag_banana  flag_peach
0        apple banana           1            1           0
1         apple peach           1            0           1
2               peach           0            0           1
3              banana           0            1           0
4               apple           1            0           0
5  apple banana peach           1            1           1
Ben Grossmann
  • 4,387
  • 1
  • 12
  • 16
1

Here is my attempt:

import pandas as pd


fruits = ['apple','banana','peach']
d = {"fruit" : ["apple banana", "apple peach", "peach","banana", "apple","apple banana peach"]}

df = pd.DataFrame(d)
x=[]
for elem in d['fruit']:
    x.append(elem.split(" "))

for f in fruits:
    df[f'flag_{f}'] = list(map(lambda e: int(f in e), x))
print(df)

I break the strings up into lists first and then check for membership using a lambda to create the new flag columns.

Output:

                fruit  flag_apple  flag_banana  flag_peach
0        apple banana           1            1           0
1         apple peach           1            0           1
2               peach           0            0           1
3              banana           0            1           0
4               apple           1            0           0
5  apple banana peach           1            1           1
Richard K Yu
  • 2,152
  • 3
  • 8
  • 21
  • 1
    Two points: first, you should generally avoid looping through the rows of a data frame; see [this post](https://stackoverflow.com/a/55557758/2476977) or [this article](https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac) for details on that. Second, there is no need to split the elements of the fruit column; `a in b` checks whether string `a` is a substring of string `b`. – Ben Grossmann Feb 01 '22 at 18:27
  • @BenGrossmann Thanks for taking the time to look through my solution and reply. I will read through these articles - I always wondered why I don't see iterative solutions for questions involving pandas! Turns out there was a reason all along – Richard K Yu Feb 01 '22 at 18:33
  • Awesome! works quite well. Thank you. – lionking19063 Feb 01 '22 at 18:45
  • In the first time, it runs great. Now it has an error "TypeError: 'list' object is not callable". Any insight? Thanks. – lionking19063 Feb 01 '22 at 19:52
  • @lionking19063 Are you running the same code exactly or is it using a different input that gives the TypeError? – Richard K Yu Feb 01 '22 at 20:08
  • @Richard K Yu After restarting the session, it works fine. Thank you. – lionking19063 Feb 01 '22 at 22:49
1

Use explode and unstack

(df.assign(f = df['fruit'].str.split())
   .explode('f')
   .assign(v=1)
   .set_index(['fruit','f'])
   .unstack(fill_value=0)
   .droplevel(level=0,axis=1)
   .rename(columns = lambda c : f'flag_{c}')
   .reset_index()
)

output

    fruit                 flag_apple    flag_banana    flag_peach
--  ------------------  ------------  -------------  ------------
 0  apple                          1              0             0
 1  apple banana                   1              1             0
 2  apple banana peach             1              1             1
 3  apple peach                    1              0             1
 4  banana                         0              1             0
 5  peach                          0              0             1
piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • I suggest you: 1. Replace `.unstack().fillna` by `unstack(fill_value=0)`, 2. Replace `.rename(...)` by `.add_prefix('flag_')`. 3. Remove `.astype(int)`. – Corralien Feb 01 '22 at 17:35
  • excellent tips, will do. actually will keep `rename` as is to show that there are different options – piterbarg Feb 01 '22 at 17:37