1

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?

Example:

import pandas as pd
import numpy as np

nRows = 2
nCols = 3

df = pd.DataFrame(index=range(nRows ), columns=range(1))

df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)

df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())

It is when creating df3 I get memory error.

The DF's in the example:

df
     0
0  NaN
1  NaN

df2
0    [[0.6704675101784022, 0.41730480236712697, 0.5...
1    [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object

df3
          0         1         2
0  0.670468  0.417305  0.558690
0  0.140387  0.198101  0.800745
KJA
  • 313
  • 3
  • 14

2 Answers2

1

First I think working with lists in pandas is not good idea, if possible, you can avoid it.

So I believe you can simplify your code a lot:

nRows = 2
nCols = 3

np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
          0         1         2
0  0.903482  0.393081  0.623970
1  0.637877  0.880499  0.299172
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The problem is that I already have those lists. I created those lists based on the answer in the following link: https://stackoverflow.com/questions/58392974/pandas-apply-multiple-columns-per-row-instead-of-list If you know a way to avoid the creation of the list and still get multiple columns in the example in the link, please let me know! – KJA Oct 18 '19 at 08:41
0

Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).

import pandas as pd
import numpy as np
import time
np.random.seed(1)

nRows = 25000
nCols = 10000
numberOfChunks = 5

df = pd.DataFrame(index=range(nRows ), columns=range(1))

df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)

for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))), 
                       np.arange(int(round(nRows/float(numberOfChunks))), nRows +  int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
    df2tmp = df2.iloc[start:stop]
    if start == 0:
        df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
        continue
    df3tmp =  pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
    df3 = pd.concat([df3, df3tmp])
KJA
  • 313
  • 3
  • 14