1

I have some rows in df, it is 7 days data with some(may be 3-5) features, I want to merge the 7-day array into a list according to feature.

Now is loop unique columns to apply list func, but is not efficient.

If you load df directly, df will automatically add a numeric suffix to duplicate columns, but concat will not

df1 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[1,2,3,4], "b":[2,3,4,5], "c":[3,4,5,6], "d":[4,5,6,7]}).set_index('userId')
df2 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[4,0,1,1], "b":[2,4,4,5], "c":[3,6,5,6], "d":[4,5,6,9]}).set_index('userId')
df3 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[1,2,5,4], "b":[2,1,4,5], "c":[3,2,5,6], "d":[4,3,4,7]}).set_index('userId')
df = pd.concat([df1,df2,df3], axis=1, sort=False)
df_new = pd.DataFrame()
columns = df.columns.unique().tolist()
for columns_name in columns:
    df_new[columns_name] = df[columns_name].apply(lambda x: x.tolist(), axis=1)
print(df_new)
                a          b          c          d
userId
u1      [1, 4, 1]  [2, 2, 2]  [3, 3, 3]  [4, 4, 4]
u2      [2, 0, 2]  [3, 4, 1]  [4, 6, 2]  [5, 5, 3]
u3      [3, 1, 5]  [4, 4, 4]  [5, 5, 5]  [6, 6, 4]
u4      [4, 1, 4]  [5, 5, 5]  [6, 6, 6]  [7, 9, 7]

The change was to apply it through for, and I wanted to find something more efficient, maybe groupby, eval, applymap or something else.

Johnny
  • 687
  • 1
  • 6
  • 22

1 Answers1

1

Use GroupBy.agg by columns names:

df1 = df.groupby(level=0, axis=1).agg(lambda x: x.tolist())
print (df1)
                a          b          c          d
userId                                            
u1      [1, 4, 1]  [2, 2, 2]  [3, 3, 3]  [4, 4, 4]
u2      [2, 0, 2]  [3, 4, 1]  [4, 6, 2]  [5, 5, 3]
u3      [3, 1, 5]  [4, 4, 4]  [5, 5, 5]  [6, 6, 4]
u4      [4, 1, 4]  [5, 5, 5]  [6, 6, 6]  [7, 9, 7]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Very useful. Thank you very much. This question has been bothering me for a long time! – Johnny Jun 05 '19 at 09:38
  • I have about 500x35 columns of test data, this line of code is about to execute 2.6s, the actual environment has about 10000x35 columns of data, the maximum is 100000x35 columns of data, I think it is still a little slow, is there any better performance? – Johnny Jun 05 '19 at 10:01
  • @Johnny - Problem is list in pandas are not [good idea](https://stackoverflow.com/a/52563718/2901002), so better is create 2d numpy array if possible for improve performance. – jezrael Jun 05 '19 at 10:06
  • @Johnny - Lists are necessary? – jezrael Jun 05 '19 at 10:09
  • 1
    Not necessary, However, to_dict('records') is used later to generate the final result, No problem at all, actually, but I'm trying to make my code more efficient. – Johnny Jun 05 '19 at 10:11