0

I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.


import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

How can I make it so that I get to see if the column, for each row, contains the specific 'Green' string?

Thank you.

cs95
  • 379,657
  • 97
  • 704
  • 746
Rodrigo
  • 69
  • 6

3 Answers3

3

I would not bother flattening the list, just use basic string matching:

df['category'].astype(str).str.contains(r'\bgreen\b')

0     True
1    False
2     True
3     True
Name: category, dtype: bool

Add the word boundary check \b so we don't accidentally match words like "greenery" or "greenwich" which have "green" as part of a larger word.


df.assign(has_green=df['category'].astype(str)
                                  .str.contains(r'\bgreen\b')
                                  .map({True: 'Y', False: 'N'}))

      user                          category has_green
0      Bob                  [[green], [red]]         Y
1     Jane                              blue         N
2  Theresa                           [green]         Y
3    Alice  [[yellow, purple], green, brown]         Y
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Great answer. But a question about `str accessor`. I generally know what is the role of `str`. But, in the situations like this, I get confused. For instance, in the first row, there is a list of list, and I guess `str` access values inside the first list, but `green` is nested in another list. So, shouldn't we add another `str` to access nested lists? I'd appreciate if you could recommend a source to totally grasp the idea of `str` on the series. Or maybe you can explain it here what is happening. Thanks! – ashkangh Mar 08 '21 at 20:05
  • 1
    @ashkangh I converted the list to a string so it doesn't matter what's inside the string anymore - it's just letters. – cs95 Mar 08 '21 at 22:37
1

You need to use a recursive flatten.

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

def flatten(x):
    rt = []
    for i in x:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

def is_green(x):
    flat_list = flatten(x)
    if "green" in flat_list:
        return True
    else:
        return False

df["has_green"] = df["category"].apply(lambda x: is_green(x))

print(df)
      user                          category  has_green
0      Bob                  [[green], [red]]       True
1     Jane                              blue      False
2  Theresa                           [green]       True
3    Alice  [[yellow, purple], green, brown]       True
Avi Thaker
  • 455
  • 3
  • 10
  • I am as of now trying the solution, only problem is and I didn't add it before, some lines have None. Where would you introduce an else condition or what would you do, if for example, Jane's category was None? Thank you! – Rodrigo Mar 08 '21 at 19:43
  • In the is_green() section @Rodrigo, you can add a check for is non; if x is None: ..., please let me know if this helps and accept the answer if it does ? – Avi Thaker Mar 08 '21 at 23:47
  • it does thank you! How can I accept two answers as correct? – Rodrigo Mar 11 '21 at 20:21
  • You cannot, only choose one. – Avi Thaker Mar 11 '21 at 23:20
1

Although I would agree that basic string matching serves the purpose of the question, I would like to draw attention to the fact that flattening lists can be achieved quite easily with pd.core.common.flatten:

import pandas as pd
import ast

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice', 'John'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown'], None]})

def fix_list(text):
    try:
        if '[' in text:
            text = ast.literal_eval(text)
        else: 
            text = [text]
    except:
        text = []
    return list(pd.core.common.flatten(text))
    
df['category'] = df['category'].apply(fix_list)
df['green'] = df['category'].apply(lambda x: 'green' in x)

Result:

user category green
0 Bob ['green', 'red'] True
1 Jane ['blue'] False
2 Theresa ['green'] True
3 Alice ['yellow', 'purple', 'green', 'brown'] True
4 John [] False
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26