1

I want to create an empty DataFrame where I will append others single row DataFrame with new data. I am trying to use panda's "Setting With Enlargement" for efficient appending.

import numpy as np
import pandas as pd
from datetime import datetime
from pandas import DataFrame

df = DataFrame(columns=["open","high","low","close","volume","open_interest"])

row_one = DataFrame({"open":10,"high":11,"low":9,"close":10,"volume":100,"open_interest":np.NAN}, index = [datetime(2017,1,1)])
row_two = DataFrame({"open":9,"high":12,"low":8,"close":10.50,"volume":500,"open_interest":np.NAN}, index = [datetime(2017,1,2)])

Now, when I try to append the new row following the setting with enlargement rules:

df[row_one.index] = row_one.columns

I get this error:

"DatetimeIndex(['2017-01-01'], dtype='datetime64[ns]', freq=None) not in index"

I thought the row should be automatically added because it is not in the DataFrame. What am I doing wrong?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
nico9T
  • 2,496
  • 2
  • 26
  • 44
  • 1
    You need to use `.loc`: `df.loc[row_one.index]`. – IanS Jul 26 '17 at 08:42
  • Are you trying to add a column or a row? – ayhan Jul 26 '17 at 08:43
  • I am trying to add rows that will always have the same columns but a different datetime as their index – nico9T Jul 26 '17 at 08:44
  • 1
    You'll be better off using `append`, or better yet (if you have all rows available at once) `concat`. – IanS Jul 26 '17 at 08:44
  • Best of all would be to avoid re-sizing a dataframe at all. If you are accumulating data in an iterative fashion then dictionaries are much more efficient (speed-wise). When you change the length of a dataframe, I believe it has to re-allocate the entire dataframe in memory (someone correct me if I'm wrong). – Bill Oct 23 '21 at 19:21
  • This [answer to a similar question](https://stackoverflow.com/a/47979665/1609514) has some timing results for the various methods. – Bill Oct 24 '21 at 00:46

2 Answers2

0

You need loc for setting by enlargement, then select index value by [0] for scalar and last 'convert' row_one to Series by selecting via iloc:

df.loc[row_one.index[0]] = row_one.iloc[0]
print (df)
            open  high  low  close  volume  open_interest
2017-01-01  10.0  11.0  9.0   10.0   100.0            NaN

But better is use concat, especially if multiple dfs:

df = pd.concat([row_one, row_two])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • is concat more efficient than setting with enlargment? – nico9T Jul 26 '17 at 08:50
  • If have multiple dataframes and need one, better is `concat`. – jezrael Jul 26 '17 at 08:51
  • 1
    I get new data from an event so I only have to add a one row dataframe each time.Using timeit I get: 1000 loops, best of 3: 511 µs per loop for setting with enlargement using loc and 1000 loops, best of 3: 1.29 ms per loop for concat so it seems that setting using loc is actually faster than pd.concat – nico9T Jul 26 '17 at 09:26
  • 1
    `concat` copies the data to a new dataframe whereas setting with enlargement appends to the current dataframe. This is why you see that setting with enlargement is more efficient. – MoustafaAAtta Jul 31 '17 at 08:36
  • @MoustafaAAtta - Thank you for comment, I think you are right. – jezrael Jul 31 '17 at 08:38
  • 1
    Please note that setting with enlargement still has to copy some data under the hood. Hence, I would recommend to preallocate a dataframe or buffer the rows to be appended whenever possible. I don't know how setting with enlargement works exactly but my reference was an issue https://github.com/pandas-dev/pandas/issues/10692 – MoustafaAAtta Jul 31 '17 at 09:14
0

Since you say in the comments

I get new data from an event so I only have to add a one row dataframe each time.

I think you are much better off pre-allocating memory in blocks or using a buffering system (as pointed out by @MoustafaAAtta in the comments).

Do you need the full, updated dataframe each iteration?

If not, do this:

new_row_data = {'open': 10.0,
 'high': 11.0,
 'low': 9.0,
 'close': 10.0,
 'volume': 100.0,
 'open_interest': np.nan}
new_row_index = pd.Timestamp('2017-01-01 00:00:00')

index = []
records = []
for _ in range(500):
    index.append(new_row_index)
    records.append(new_row_data)  # add new data here

# Create dataframe at the end
df = pd.DataFrame.from_records(records, index=index)

(Code above takes about 2.4 ms).

If you need the dataframe each iteration:

buffer_size = 100  # adjust to your needs
data_columns = ["open","high","low","close","volume","open_interest"]
all_columns = ['DateTime'] + data_columns  # Add column for datetimes
df_empty = pd.DataFrame(None, index=range(buffer_size),
                        columns=all_columns)
# Note: You might want to specify dtypes above rather than np.nans

df = df_empty.copy()
index = 0
for _ in range(500):
    df.loc[index, 'DateTime'] = new_row_index
    df.loc[index, columns] = new_row_data  # add new data here
    # Updated dataframe if you need it:
    #print(df.loc[:index])

    index += 1
    while index >= len(df):
        df = pd.concat([df, df_empty.reindex(range(index, index + buffer_size))])

# To remove the integer index use:
df = df.loc[:index-1].set_index('DateTime', drop=True)

(Code above takes about 540 ms).

You will find both of these are much faster overall than using concat or append each iteration (not what DataFrames were designed for).

Bill
  • 10,323
  • 10
  • 62
  • 85