6

I have a large data-frame about 160k rows by 24 columns. I also have a pandas series of length 26 that I would like to add row-wise to my data-frame to make a final data-frame that is 160k rows by 50 columns, but my code is painfully slow.

Specifically this is slow, but it works: final = df.apply(lambda x: x.append(my_series), axis=1)

Which yields the correct final shape: Out[49]: (163008, 50)

Where, df.shape is Out[48]: (163008, 24) and my_series.shape is Out[47]: (26,)

This method performs fine for smaller dataframes in the <50k rows range, but clearly it is not ideal.

Update: Added Benchmarks For the Solutions Below

Did a few tests using %timeit with a test dataframe and a test series, with the following sizes: test_df.shape

Out[18]: (156108, 24)

test_series.shape

Out[20]: (26,)

Where both the data-frame and the series contain a mix of strings, floats, integers, objects, etc.

Accepted Solution Using Numpy:

%timeit test_df.join(pd.DataFrame(np.tile(test_series.values, len(test_df.index)).reshape(-1, len(attributes)), index=test_df.index, columns=test_series.index))

10 loops, best of 3: 220 ms per loop

Using assign: I keep receiving ValueError: Length of values does not match length of index with my test series though when I use the simpler series provided it works, not sure what is going on here......

Using Custom Function by @Divakar

%timeit rowwise_concat_df_series(test_df, test_series)

1 loop, best of 3: 424 ms per loop

guy
  • 1,021
  • 2
  • 16
  • 40
  • 1
    I think you should add numpy in your tags to get a really good numpy person to review this question. – Scott Boston Jul 19 '17 at 12:36
  • What do you have in the dataframes? Is it all numbers (numerals)? – Divakar Jul 19 '17 at 12:59
  • @Divakar it's a mix of floats, integers (64bit and 8bit), strings, objects for the dataframe and the Series contains object dtypes – guy Jul 19 '17 at 13:11
  • So, did you get a chance to time the posted solutions on your actual dataset? – Divakar Jul 25 '17 at 11:58
  • @Divakar Apologies, I got too deep into optimizing other stuff I forgot about benchmarking this. I will add this to my to-do list and post the results today. Thanks for the reminder. – guy Jul 25 '17 at 12:07
  • @Divakar See my updated question for benchmarks – guy Jul 25 '17 at 13:04

3 Answers3

4

We can use DataFrame.assign() method:

Setup:

In [37]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C'])

In [38]: my_series = pd.Series([10,11,12], index=['X','Y','Z'])

In [39]: df
Out[39]:
          A         B         C
0  1.129066  0.975453 -0.737507
1 -0.347736 -1.469583 -0.727113
2  1.158480  0.933604 -1.219617
3 -0.689830  3.063868  0.345233
4  0.184248  0.920349 -0.852213

In [40]: my_series
Out[40]:
X    10
Y    11
Z    12
dtype: int64

Solution:

In [41]: df = df.assign(**my_series)

Result:

In [42]: df
Out[42]:
          A         B         C   X   Y   Z
0  1.129066  0.975453 -0.737507  10  11  12
1 -0.347736 -1.469583 -0.727113  10  11  12
2  1.158480  0.933604 -1.219617  10  11  12
3 -0.689830  3.063868  0.345233  10  11  12
4  0.184248  0.920349 -0.852213  10  11  12

NOTE: the series should have string index elements.

PS **variable explained

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

I think you need numpy.tile with numpy.ndarray.reshape for new df by Series values and last join:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df)
   A  B  C  D  E  F
0  a  4  7  1  5  a
1  b  5  8  3  3  a
2  c  4  9  5  6  a
3  d  5  4  7  9  b
4  e  5  2  1  2  b
5  f  4  3  0  4  b

s = pd.Series([1,5,6,7], index=list('abcd'))
print (s)
a    1
b    5
c    6
d    7
dtype: int64

df1 = pd.DataFrame(np.tile(s.values, len(df.index)).reshape(-1,len(s)), 
                   index=df.index, 
                   columns=s.index)
print (df1)
   a  b  c  d
0  1  5  6  7
1  1  5  6  7
2  1  5  6  7
3  1  5  6  7
4  1  5  6  7
5  1  5  6  7

df = df.join(df1)
print (df)
   A  B  C  D  E  F  a  b  c  d
0  a  4  7  1  5  a  1  5  6  7
1  b  5  8  3  3  a  1  5  6  7
2  c  4  9  5  6  a  1  5  6  7
3  d  5  4  7  9  b  1  5  6  7
4  e  5  2  1  2  b  1  5  6  7
5  f  4  3  0  4  b  1  5  6  7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

See if this works -

def rowwise_concat_df_series(df, s):
    a = df.values
    b = s.values

    m,n,r = a.shape + (b.size,)
    out_dtype = np.result_type(a.dtype, b.dtype)
    out_arr = np.empty((m, n + r),dtype=out_dtype)
    out_arr[:,:n] = a
    out_arr[:,n:] = b
    df_out = pd.DataFrame(out_arr)
    return df_out

Sample run -

In [284]: df
Out[284]: 
   0  1
0  4  1
1  0  1
2  8  2
3  1  8
4  3  3

In [285]: s
Out[285]: 
0    5
1    8
2    2
dtype: int64

In [286]: rowwise_concat_df_series(df, s)
Out[286]: 
   0  1  2  3  4
0  4  1  5  8  2
1  0  1  5  8  2
2  8  2  5  8  2
3  1  8  5  8  2
4  3  3  5  8  2
Divakar
  • 218,885
  • 19
  • 262
  • 358