Setting With Enlargement: How to add one row DataFrame to another DataFrame

Question

I want to create an empty DataFrame where I will append others single row DataFrame with new data. I am trying to use panda's "Setting With Enlargement" for efficient appending.

import numpy as np
import pandas as pd
from datetime import datetime
from pandas import DataFrame

df = DataFrame(columns=["open","high","low","close","volume","open_interest"])

row_one = DataFrame({"open":10,"high":11,"low":9,"close":10,"volume":100,"open_interest":np.NAN}, index = [datetime(2017,1,1)])
row_two = DataFrame({"open":9,"high":12,"low":8,"close":10.50,"volume":500,"open_interest":np.NAN}, index = [datetime(2017,1,2)])

Now, when I try to append the new row following the setting with enlargement rules:

df[row_one.index] = row_one.columns

I get this error:

"DatetimeIndex(['2017-01-01'], dtype='datetime64[ns]', freq=None) not in index"

I thought the row should be automatically added because it is not in the DataFrame. What am I doing wrong?

I am trying to add rows that will always have the same columns but a different datetime as their index — nico9T, Jul 26 '17 at 08:44
You'll be better off using `append`, or better yet (if you have all rows available at once) `concat`. — IanS, Jul 26 '17 at 08:44
Best of all would be to avoid re-sizing a dataframe at all. If you are accumulating data in an iterative fashion then dictionaries are much more efficient (speed-wise). When you change the length of a dataframe, I believe it has to re-allocate the entire dataframe in memory (someone correct me if I'm wrong). — Bill, Oct 23 '21 at 19:21
This [answer to a similar question](https://stackoverflow.com/a/47979665/1609514) has some timing results for the various methods. — Bill, Oct 24 '21 at 00:46

jezrael · Accepted Answer · 2017-07-26T08:51:59.270

0

You need loc for setting by enlargement, then select index value by [0] for scalar and last 'convert' row_one to Series by selecting via iloc:

df.loc[row_one.index[0]] = row_one.iloc[0]
print (df)
            open  high  low  close  volume  open_interest
2017-01-01  10.0  11.0  9.0   10.0   100.0            NaN

But better is use concat, especially if multiple dfs:

df = pd.concat([row_one, row_two])

edited Jul 26 '17 at 08:51

answered Jul 26 '17 at 08:48

jezrael

822,522
95
1,334
1,252

is concat more efficient than setting with enlargment? – nico9T Jul 26 '17 at 08:50
If have multiple dataframes and need one, better is `concat`. – jezrael Jul 26 '17 at 08:51
1

I get new data from an event so I only have to add a one row dataframe each time.Using timeit I get: 1000 loops, best of 3: 511 µs per loop for setting with enlargement using loc and 1000 loops, best of 3: 1.29 ms per loop for concat so it seems that setting using loc is actually faster than pd.concat – nico9T Jul 26 '17 at 09:26
1

`concat` copies the data to a new dataframe whereas setting with enlargement appends to the current dataframe. This is why you see that setting with enlargement is more efficient. – MoustafaAAtta Jul 31 '17 at 08:36
@MoustafaAAtta - Thank you for comment, I think you are right. – jezrael Jul 31 '17 at 08:38
1

Please note that setting with enlargement still has to copy some data under the hood. Hence, I would recommend to preallocate a dataframe or buffer the rows to be appended whenever possible. I don't know how setting with enlargement works exactly but my reference was an issue https://github.com/pandas-dev/pandas/issues/10692 – MoustafaAAtta Jul 31 '17 at 09:14

Bill · Answer 2 · 2021-10-23T20:37:12.020

Since you say in the comments

I get new data from an event so I only have to add a one row dataframe each time.

I think you are much better off pre-allocating memory in blocks or using a buffering system (as pointed out by @MoustafaAAtta in the comments).

Do you need the full, updated dataframe each iteration?

If not, do this:

new_row_data = {'open': 10.0,
 'high': 11.0,
 'low': 9.0,
 'close': 10.0,
 'volume': 100.0,
 'open_interest': np.nan}
new_row_index = pd.Timestamp('2017-01-01 00:00:00')

index = []
records = []
for _ in range(500):
    index.append(new_row_index)
    records.append(new_row_data)  # add new data here

# Create dataframe at the end
df = pd.DataFrame.from_records(records, index=index)

(Code above takes about 2.4 ms).

If you need the dataframe each iteration:

buffer_size = 100  # adjust to your needs
data_columns = ["open","high","low","close","volume","open_interest"]
all_columns = ['DateTime'] + data_columns  # Add column for datetimes
df_empty = pd.DataFrame(None, index=range(buffer_size),
                        columns=all_columns)
# Note: You might want to specify dtypes above rather than np.nans

df = df_empty.copy()
index = 0
for _ in range(500):
    df.loc[index, 'DateTime'] = new_row_index
    df.loc[index, columns] = new_row_data  # add new data here
    # Updated dataframe if you need it:
    #print(df.loc[:index])

    index += 1
    while index >= len(df):
        df = pd.concat([df, df_empty.reindex(range(index, index + buffer_size))])

# To remove the integer index use:
df = df.loc[:index-1].set_index('DateTime', drop=True)

(Code above takes about 540 ms).

You will find both of these are much faster overall than using concat or append each iteration (not what DataFrames were designed for).

Setting With Enlargement: How to add one row DataFrame to another DataFrame

2 Answers2