How to change some rows to list in dataframe?

Question

I have some rows in df, it is 7 days data with some(may be 3-5) features, I want to merge the 7-day array into a list according to feature.

Now is loop unique columns to apply list func, but is not efficient.

If you load df directly, df will automatically add a numeric suffix to duplicate columns, but concat will not

df1 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[1,2,3,4], "b":[2,3,4,5], "c":[3,4,5,6], "d":[4,5,6,7]}).set_index('userId')
df2 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[4,0,1,1], "b":[2,4,4,5], "c":[3,6,5,6], "d":[4,5,6,9]}).set_index('userId')
df3 = pd.DataFrame({"userId":["u1", "u2", "u3", "u4"], "a":[1,2,5,4], "b":[2,1,4,5], "c":[3,2,5,6], "d":[4,3,4,7]}).set_index('userId')
df = pd.concat([df1,df2,df3], axis=1, sort=False)
df_new = pd.DataFrame()
columns = df.columns.unique().tolist()
for columns_name in columns:
    df_new[columns_name] = df[columns_name].apply(lambda x: x.tolist(), axis=1)
print(df_new)
                a          b          c          d
userId
u1      [1, 4, 1]  [2, 2, 2]  [3, 3, 3]  [4, 4, 4]
u2      [2, 0, 2]  [3, 4, 1]  [4, 6, 2]  [5, 5, 3]
u3      [3, 1, 5]  [4, 4, 4]  [5, 5, 5]  [6, 6, 4]
u4      [4, 1, 4]  [5, 5, 5]  [6, 6, 6]  [7, 9, 7]

The change was to apply it through for, and I wanted to find something more efficient, maybe groupby, eval, applymap or something else.

score 1 · Accepted Answer · answered Jun 05 '19 at 09:34

1

Use GroupBy.agg by columns names:

df1 = df.groupby(level=0, axis=1).agg(lambda x: x.tolist())
print (df1)
                a          b          c          d
userId                                            
u1      [1, 4, 1]  [2, 2, 2]  [3, 3, 3]  [4, 4, 4]
u2      [2, 0, 2]  [3, 4, 1]  [4, 6, 2]  [5, 5, 3]
u3      [3, 1, 5]  [4, 4, 4]  [5, 5, 5]  [6, 6, 4]
u4      [4, 1, 4]  [5, 5, 5]  [6, 6, 6]  [7, 9, 7]

answered Jun 05 '19 at 09:34

jezrael

822,522
95
1,334
1,252

Very useful. Thank you very much. This question has been bothering me for a long time! – Johnny Jun 05 '19 at 09:38
I have about 500x35 columns of test data, this line of code is about to execute 2.6s, the actual environment has about 10000x35 columns of data, the maximum is 100000x35 columns of data, I think it is still a little slow, is there any better performance? – Johnny Jun 05 '19 at 10:01
@Johnny - Problem is list in pandas are not [good idea](https://stackoverflow.com/a/52563718/2901002), so better is create 2d numpy array if possible for improve performance. – jezrael Jun 05 '19 at 10:06
@Johnny - Lists are necessary? – jezrael Jun 05 '19 at 10:09
1

Not necessary, However, to_dict('records') is used later to generate the final result, No problem at all, actually, but I'm trying to make my code more efficient. – Johnny Jun 05 '19 at 10:11

How to change some rows to list in dataframe?

1 Answers1