2

I have a pandas data frame in which one of the columns is a Series itself. Eg:

df.head()

Col1    Col2  
1       ["name1","name2","name3"]  
1       ["name3","name2","name4"]  
2       ["name1","name2","name3"] 
2       ["name1","name5","name6"] 

I need to concatenate the Col2 in groups of Col1. I want something like

Col1    Col2  
1       ["name1","name2","name3","name4"]  
2       ["name1","name2","name3","name5","name6"]

I tries using a groupby as

.agg({"Col2":lambda x: pd.Series.append(x)})

But this throws error saying two parameters are required. I also tried using sum in the agg function. That fails with error does not reduce.

How do I do this?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Sarvo
  • 111
  • 1
  • 6

2 Answers2

1

Yea, you wouldn't be able to use .aggby{} on categorical data like this. Anyway, here's my stab at the problem, using the help up of numpy. (commented for clarity)

import numpy as np

# Set group by ("Col1") unique values
groupby = df["Col1"].unique()

# Create empty dict to store values on each iteration
d = {}

for i,val in enumerate(groupby):

    # Set "Col1" key, to the unique value (e.g., 1)
    d.setdefault("Col1",[]).append(val)

    # Create empty list to store values from "Col2"
    col2_unis=[]

    # Create sub-DataFrame for each unique groupby value
    sdf = df.loc[df["Col1"]==val]

    # Loop through the 2D-array/Series "Col2" and append each 
    # value to col_unis (using list comprehension)
    col2_unis.append([[j for j in array] for i,array in enumerate(sdf["Col2"].values)])

    # Set "Col2" key, to be unique values of col2_unis
    d.setdefault("Col2",[]).append(np.unique(col2_unis))

new_df = pd.DataFrame(d)

print(new_df)

A more condensed version would look like:

d = {}
for i,val in enumerate(df["Col1"].unique()):
    d.setdefault("Col1",[]).append(val)
    sdf = df.loc[df["Col1"]==val]
    d.setdefault("Col2",[]).append(np.unique([[j for j in array] for i,array in enumerate(df.loc[df["Col1"]==val, "Col2"].values)]))
new_df = pd.DataFrame(d)
print(new_df)

Learn more about Python's .setdefault() function for dictionaries, by checking out this related SO question.

Community
  • 1
  • 1
semore_1267
  • 1,327
  • 2
  • 14
  • 29
1

You can use groupby with apply custom function, where first flatten nested lists by chain (fastest solution), then remove duplicates by set, convert to list and last sort:

import pandas as pd
from  itertools import chain

df = pd.DataFrame({'Col1':[1,1,2,2],
                   'Col2':[["name1","name2","name3"],
                           ["name3","name2","name4"],
                           ["name1","name2","name3"],
                           ["name1","name5","name6"]]})

print (df)
   Col1                   Col2
0     1  [name1, name2, name3]
1     1  [name3, name2, name4]
2     2  [name1, name2, name3]
3     2  [name1, name5, name6]
print (df.groupby('Col1')['Col2']
         .apply(lambda x: sorted(list(set(list(chain.from_iterable(x))))))
         .reset_index())
   Col1                                 Col2
0     1         [name1, name2, name3, name4]
1     2  [name1, name2, name3, name5, name6]

Solution can be more simplier, only chain, set and sorted is necessary:

print (df.groupby('Col1')['Col2']
         .apply(lambda x: sorted(set(chain.from_iterable(x))))
         .reset_index())

   Col1                                 Col2
0     1         [name1, name2, name3, name4]
1     2  [name1, name2, name3, name5, name6]
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252