13
Dataframe:
  one two
a  1  x
b  1  y
c  2  y
d  2  z
e  3  z

grp = DataFrame.groupby('one')
grp.agg(lambda x: ???) #or equivalent function

Desired output from grp.agg:

one two
1   x|y
2   y|z
3   z

My agg function before integrating dataframes was "|".join(sorted(set(x))). Ideally I want to have any number of columns in the group and agg returns the "|".join(sorted(set()) for each column item like two above. I also tried np.char.join().

Love Pandas and it has taken me from a 800 line complicated program to a 400 line walk in the park that zooms. Thank you :)

Owen
  • 1,652
  • 2
  • 20
  • 24
brian_the_bungler
  • 991
  • 2
  • 7
  • 12

3 Answers3

16

You were so close:

In [1]: df.groupby('one').agg(lambda x: "|".join(x.tolist()))
Out[1]:
     two
one
1    x|y
2    y|z
3      z

Expanded answer to handle sorting and take only the set:

In [1]: df = DataFrame({'one':[1,1,2,2,3], 'two':list('xyyzz'), 'three':list('eecba')}, index=list('abcde'), columns=['one','two','three'])

In [2]: df
Out[2]:
   one two three
a    1   x     e
b    1   y     e
c    2   y     c
d    2   z     b
e    3   z     a

In [3]: df.groupby('one').agg(lambda x: "|".join(x.order().unique().tolist()))
Out[3]:
     two three
one
1    x|y     e
2    y|z   b|c
3      z     a
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Awesome. I was hacking out the aweful `grp2.agg(lambda x: u"|".join(sorted(set(map(str, x.tolist())))))`. Thanks for showing me the ropes on using arrays for real! Where is a good reference? Thanks again. – brian_the_bungler Jan 09 '13 at 22:48
  • Honestly, Ipython and experimenting with code snippets has done more for my understanding than any one resource. But Wes McKinney's Python for Data Analysis is a great reference. – Zelazny7 Jan 09 '13 at 23:03
  • I have been reading the book since Dec but still lots to practice. FYI I took a look at some of your HDF5 store questions, I ran into same flexibility problems with it. I work with 3 million row data sets with 60 columns, lots of text and MongoDB has been a champ. – brian_the_bungler Jan 10 '13 at 03:59
  • Would you mind sharing some of your mongoDB code and how you use it with pandas? I am trying to nail down a consistent workflow for using pandas with very large datasets (but not 'big' data). I can ask a proper SE question' too if you like. I also thought of one more resource: Wes's 2012 pycon tutorial. It was very thorough and helped cement several concepts for me. – Zelazny7 Jan 10 '13 at 12:51
  • I would be glad to post it but I think a question format is the way to go. It would be neat to see what others have to say too. I will have time this weekend to do it justice. – brian_the_bungler Jan 10 '13 at 15:18
  • Thanks, I created a question here: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – Zelazny7 Jan 10 '13 at 16:23
  • in pandas version 1.3.1, .sort() should be replaced with .sort_values() – KH Kim Aug 01 '21 at 06:19
2

Just an elaboration on the accepted answer:

df.groupby('one').agg(lambda x: "|".join(x.tolist()))

Note that the type of df.groupby('one') is SeriesGroupBy. And the function agg defined on this type. If you check the documentation of this function, it says its input is a function that works on Series. This means that x type in the above lambda is Series.

Another note is that defining the agg function as lambda is not necessary. If the aggregation function is complex, it can be defined separately as a regular function like below. The only constraint is that the x type should be of Series (or compatible with it):

def myfun1(x):
    return "|".join(x.tolist())

and then:

df.groupby('one').agg(myfun1)
qartal
  • 2,024
  • 19
  • 31
1

There is a better way to concatenate strings, in pandas documentation.
So I prefer this way:

In [1]: df.groupby('one').agg(lambda x: x.str.cat(sep='|'))
Out[1]:
     two
one
1    x|y
2    y|z
3      z
Lahiru Karunaratne
  • 2,020
  • 16
  • 18