How to improve the speed of concat in pandas

Question

I want to expand my dataframe with duplicate the row regularly.

import pandas as pd 
import numpy as np 
def expandData(data, timeStep=2, sampleLen= 5):
    dataEp = pd.DataFrame()
    for epoch in range(int(len(data)/sampleLen)):
        dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLen, :]
        for num in range(int(sampleLen-timeStep +1)):
            tempDf = dataSample.iloc[num:timeStep+num,:]
            dataEp = pd.concat([dataEp, tempDf],axis= 0)
    return dataEp

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
dfEp = expandData(df, 3, 5)

Output:

df
     a  other
0   0    100
1   1    101
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
8  18    108
9  19    109

dfEp
     a  other
0   0    100
1   1    101
2   2    102
1   1    101
2   2    102
3   3    103
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
6  16    106
7  17    107
8  18    108
7  17    107
8  18    108
9  19    109

Expected:

I expect a better a way of achieving it with good performance, as if the dataframe has large row size,such as 40 thousands rows, my code will run for about 20 minutes.

Edit:

Actually, I expect to repeat a small sequence with size of timeStep. And I have changed expandData(df, 2, 5) into expandData(df, 3, 5).

indices 0,4,15,19 not repeated, can you show the constraint you have incorporated — Naga kiran, Oct 28 '18 at 08:03
It seems you try to separate continuous intervals into stepwise intervals. Is this step in `a` always 1? And are you sure you need this? Sounds like an XY problem. What would be the next calculation you want to perform on each of these newly defined intervals? — Mr. T, Oct 28 '18 at 08:16
I use it for `RNN` model, and you can regard the number from `0` to `4` as a sample， the number from `15` to `19` as another sample. And sorry for misleading. — rosefun, Oct 28 '18 at 09:32

Mr. T · Answer 1 · 2018-10-28T09:04:54.310

If your a values are evenly spaced, you can test for breaks in the series and then replicate the rows that are within each consecutive series according to this answer:

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
#equally spaced rows have value zero, start/stop rows not
df["start/stop"] = df.a.diff().shift(-1) - df.a.diff()
#repeat rows with value zero in the new column
repeat = [2 if val == 0 else 1 for val in df["start/stop"]]
df = df.loc[np.repeat(df.index.values, repeat)]
print(df)

Sample output:

    a  other  start/stop
0   0    100         NaN
1   1    101         0.0
1   1    101         0.0
2   2    102         0.0
2   2    102         0.0
3   3    103         0.0
3   3    103         0.0
4   4    104        10.0
5  15    105       -10.0
6  16    106         0.0
6  16    106         0.0
7  17    107         0.0
7  17    107         0.0
8  18    108         0.0
8  18    108         0.0
9  19    109         NaN

If it is just about the epoch length (you do not specify clearly the rules), then it is even simpler:

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})

sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)

df = df.loc[np.repeat(df.index.values, repeat)]

Hi, thanks! Maybe it's not I expected when the `timeStep` is not equal to `2`. — rosefun, Oct 28 '18 at 09:25

How to improve the speed of concat in pandas

1 Answers1