1

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.

What is the fastest and most efficient way to append rows to a DataFrame? I tried the solution in this link, tried to create a dictionary, but I couldn't do it.

Here is the code which takes around 1.5 seconds for 500 data.

def add_new(df,base,interval):
    df_appended = pd.DataFrame() 
    np.random.seed(5)
    s = np.random.normal(base,interval/3,4)
    s = np.append(s,base)
    for i in range(0,5):
        df_new = df
        df_new["DeltaG"] = s[i]
        df_appended = df_appended.append(df_new)
    return df_appended
kukuro
  • 75
  • 1
  • 10

2 Answers2

1

DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames. But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.

So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.

Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.

Good examples with code: Add one row to pandas DataFrame

If you need more code examples - let me know.

Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
  • Well, thanks for your answer. Normally "df" argument is a single line of dataframe column. Could you please let me know how to get easily a single line of dataframe? I use df[i:i+1] for each line. I think it's inefficient as well. – kukuro Jan 20 '21 at 15:46
  • @kukuro I suppose the best way of selecting one row from data frame - using `.loc` for selecting by label (index column value for this row) or `.iloc` for selecting by integer index. Read more about this here for example: https://stackoverflow.com/questions/16096627/selecting-a-row-of-pandas-series-dataframe-by-integer-index – Mikhail_Sam Jan 27 '21 at 08:56
1
def add_new(df1,base,interval,has_interval):
    dictionary = {}
    if has_interval == 0:
        for i in range(0,5):
            dictionary[i] = (df1.copy())
    elif has_interval == 1:
        np.random.seed(5)
        s = np.random.normal(base,interval/3,4)
        s = np.append(s,base)
        for i in range(0,5):
            df_new = df1
            df_new[4] = s[i]

            dictionary[i] = (df_new.copy())
    return dictionary

It works. It takes around 10 seconds for whole data. Thanks for your answers.

kukuro
  • 75
  • 1
  • 10