0

I have a dataframe with two columns, with string entries, eg:

Col1 Col2
E1     K1
E2     K2
E3     K3
...

I want to create an index on Col2 but the entries there are not unique. So I would like to merge all rows of shared entries in Col2. Say K2, K3 are identical. Then after the merge what I would like to have is the Col1 values serialized into a list forming just one row, ie:

Col1 Col2
E1     K1
[E2, E3]     K2
...

What would be the cleanest way of achieving this?

gen
  • 9,528
  • 14
  • 35
  • 64
  • how do you determine that K2 and K3 are identical? – Onyambu Jul 19 '18 at 23:25
  • @Onyambu k2 == k3; the entries are all strings; sorry for the lack of clarity – gen Jul 19 '18 at 23:41
  • then you should just use k2 in both cases. no need of k3. replace k3 ith k2 – Onyambu Jul 19 '18 at 23:42
  • _why_ do you want to do this? is this for reporting/human viewability? if so, that's ok, but if it's for data analysis, it's almost certainly the wrong approach. – acushner Jul 19 '18 at 23:42
  • @acushner what approach would you suggest instead? – gen Jul 20 '18 at 02:35
  • Possible duplicate of [Pandas groupby: How to get a union of strings](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings) – gen Jul 20 '18 at 17:59

3 Answers3

1

First change K3 into K2 then groupby column2 and aggregate on it

 df[df=='K3']='K2'
 df.groupby('Col2').agg(list)
          Col1
Col2          
K1        [E1]
K2    [E2, E3]
Onyambu
  • 67,392
  • 3
  • 24
  • 53
1

You can using mapper with groupby

df.set_index('Col2').groupby(by={'K1':'K1','K2':'K2','K3':'K2'},axis=0).Col1.apply(list)
Out[116]: 
K1        [E1]
K2    [E2, E3]
Name: Col1, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Perhaps my original post was misleading, but the table above was just an example. A priori I no way of knowing which indices occur repeatedly, and in particular the dataframe has a variable number if rows. – gen Jul 20 '18 at 04:04
0

since you don't know a priori what the dupes are, you can do:

df.groupby('Col2').agg(lambda s: list(s))

for some reason .agg(list) doesn't work for me...

acushner
  • 9,595
  • 1
  • 34
  • 34