0

I have trouble making pandas returning multiple columns when using apply.

Example:

import pandas as pd
import numpy as np
np.random.seed(1)

df = pd.DataFrame(index=range(2), columns=['a', 'b'])
df.loc[0] = [np.array((1,2,3))], 1
df.loc[1] = [np.array((4,5,6))], 1
df

             a  b
0  [[1, 2, 3]]  1
1  [[4, 5, 6]]  1

df2 = np.random.randint(1,9, size=(3,2))
df2

array([[4, 6],
       [8, 1],
       [1, 2]])

def example(x):
    return np.transpose(df2) @ x[0]

df3 = df['a'].apply(example)
df3

0    [23, 14]
1    [62, 41]

I want df3 to have two columns with one element in each per column per row, not one column with both elements per row.

So I want somthing like

df3Wanted
         col1  col2
    0    23    14
    1    62    41

Does anybody know how to fix this?

KJA
  • 313
  • 3
  • 14

2 Answers2

2

Couple of changes are required to achieve this:

Update below function as below

def example(x):
    return [np.transpose(df2) @ x[0]]

and perform below operation on df3

wantedDF3 = pd.concat(df3.apply(pd.DataFrame, columns=['col1','col2']).tolist())

print(wantedDF3) gives desired output:

 col1  col2
0    40    12
0    97    33

Edit: Another way to do the same thing, to avoid memory error issues: Keep your example function and df3 as it is (same as question) Now, just on top of that, use below code to generate wantedDF3

col1df = pd.DataFrame(df3.apply(lambda x: x[0]).values, columns=['col1'])
col2df = pd.DataFrame(df3.apply(lambda x: x[1]).values,  columns=['col2'])
wantedDF3 = col1df.join(col2df)
Parth
  • 644
  • 4
  • 10
  • When applying the above on my real data, i get "memory error" when running the line corresponding to wantedDF3 = pd.concat(df3.apply(pd.DataFrame, columns=['col1','col2']).tolist()). Any tips on how to easily avoid this issue? – KJA Oct 16 '19 at 13:33
  • Multiple things to check. Size of real data and RAM of your machine. And post full error stack trace. – Parth Oct 17 '19 at 11:57
  • Is there perhaps a way to create DF3 without creating lists? From another post it is suggested that lists are what cause the memory error in this case. The other post where the list type is suggested as problem sourse follows https://stackoverflow.com/questions/58444745/pandas-memory-error-when-using-apply-to-split-single-column-array-into-columns – KJA Oct 18 '19 at 08:55
  • Okay, but your `df` is also having `list` in first place. That could also be the reason for memory error. – Parth Oct 18 '19 at 09:43
  • @ Parth: Your suggestion works. However, sadly it seems to be very slow in my case. I have 25 000 rows and 15 000 columns. – KJA Oct 18 '19 at 13:09
  • The following line, in combination with my original creation of df3, seems to be faster than the previous alternatives: wantedDF3 = pd.DataFrame(df3.tolist(), index=df3.index) – KJA Oct 18 '19 at 17:55
  • My last suggestion also give memory error for large dataframes. I think I have found a solution to the problem by 1) splitting the operations into chunks, 2) change the of each resulting chunk df to float16, and concatenate the chunk DFs to one DF. – KJA Oct 19 '19 at 04:48
0

This is an answer to the comments of the first answer and concerns the issue of memory error. The following example uses data that gives memory error on my computer with all methods suggested so far (the first answer and the comments in the 1st answer), but it works with the code below:

import pandas as pd
import numpy as np
import time
np.random.seed(1)

nRows = 25000
nCols = 10000
numberOfChunks = 5

df = pd.DataFrame(index=range(nRows ), columns=range(1))

df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)

for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))), 
                       np.arange(int(round(nRows/float(numberOfChunks))), nRows +  int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
    df2tmp = df2.iloc[start:stop]
    if start == 0:
        df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
        continue
    df3tmp =  pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
    df3 = pd.concat([df3, df3tmp])
KJA
  • 313
  • 3
  • 14