0

I have a Pandas Dataframe in the below format.

[apple]
[banana]
[apple, orange]

I would like to convert this such that it has only unique values but it split by row for each value:

apple
banana
orange
anky
  • 74,114
  • 11
  • 41
  • 70
scott martin
  • 1,253
  • 1
  • 14
  • 36

2 Answers2

2

First unnest your list to rows, then use drop_duplicates:

# Make example dataframe
df = pd.DataFrame({'Col1':[['apple'], ['banana'], ['apple', 'orange']]})

              Col1
0          [apple]
1         [banana]
2  [apple, orange]

df = explode_list(df, 'Col1').drop_duplicates()

Output

     Col1
0   apple
1  banana
2  orange

Function used from linked answer

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})
Erfan
  • 40,971
  • 8
  • 66
  • 78
2

You can use itertools.chain and from_iterable() to flatten list of lists and the OrderedDict to drop duplicates maintaining order:

from collections import OrderedDict
import itertools

df['col2']=OrderedDict.fromkeys(itertools.chain.from_iterable(df.col)).keys()
print(df)

               col    col2
0          [apple]   apple
1         [banana]  banana
2  [apple, orange]  orange
anky
  • 74,114
  • 11
  • 41
  • 70
  • thank you for that but I am getting an error `TypeError: 'float' object is not iterable` though the column we iterating is a column of dtype `object` – scott martin Jun 28 '19 at 09:08
  • @scottmartin hmm. works for me for the sample. are you using this independently is are you integrating this line with some other code, you have to see why it fails. Not sure. – anky Jun 28 '19 at 09:11