838

I'm starting from the pandas DataFrame documentation here: Introduction to data structures

I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.

I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.

import pandas as pd
import datetime as dt
import scipy as s
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(9, -1, -1) ]

valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
    valdict[symb] = pd.Series( s.zeros(len(dates)), dates )

for thedate in dates:
    if thedate > dates[0]:
        for symb in valdict:
            valdict[symb][thedate] = 1 + valdict[symb][thedate - dt.timedelta(days=1)]
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Matthias Kauer
  • 9,697
  • 5
  • 17
  • 19
  • 66
    Never grow a DataFrame! It is always cheaper to append to a python list and then convert it to a DataFrame at the end, both in terms of memory and performance. – cs95 Feb 29 '20 at 12:04
  • 1
    @cs95 What is functionally different between `.append` in pd and appending a list? I know `.append`in pandas copys the whole dataset to a new object ´, does pythons append work differently? – Lamma Apr 03 '20 at 09:16
  • 3
    @Lamma please find details in my answer below. When appending to df, a new DataFrame is created each time in memory instead of using the existing one, which is quite frankly a waste. – cs95 Jun 05 '20 at 02:38
  • 2
    append is now officially deprecated https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html – rubengavidia0x Mar 08 '22 at 20:10

8 Answers8

921

NEVER grow a DataFrame row-wise!

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Accumulate data in a list, not a DataFrame.

Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.

data = []
for row in some_function_that_yields_data():
    data.append(row)

df = pd.DataFrame(data)

pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.

Pros of this approach:

  1. It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.

  2. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

  3. dtypes are automatically inferred (rather than assigning object to all of them).

  4. A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.

If you aren't convinced yet, this is also mentioned in the documentation:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

pandas >= 2.0 Update: append has been removed!

DataFrame.append was deprecated in version 1.4 and removed from the pandas API entirely in version 2.0.

See the docs on Deprecations as well as this github issue that originally proposed its deprecation.



These options are horrible

append or concat inside a loop

Here is the biggest mistake I've seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]


The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 38
    This is literally in the documentation. "Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.append.html – endolith Aug 11 '19 at 00:06
  • 2
    Also "Note It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension." https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects – endolith Aug 11 '19 at 00:07
  • 2
    So, what do I do when my data "comes in" as 1d lists one at a time with each one representing a column in a data frame? How do I append them together before converting into a dataframe? It seems that `list1.apped(list2)` insets a list within another list rather than adding a column. Thanks – Confounded Mar 11 '20 at 19:59
  • 3
    @Confounded That's a different problem than the one asked here, but it should be ok to assign one column at a time to an empty Dataframe. The issue arises with successive appending of rows. – cs95 Jan 16 '21 at 07:34
  • You're not wrong, but if I am mapping from one DF to another, I can only take advantage of vectorized columns if I have initialized the new DF with the correct number of rows (that matches the other) – Alex Sep 10 '21 at 10:52
  • @Alex you can just assign one column at a time, but it depends on your use case and "mapping from one DF to another" sounds a bit ambiguous – cs95 Oct 01 '21 at 11:08
  • @cs95 in my case I have a requirement of mapping into a new dataframe from two others, but the rows are of varying size. I needed to initialize the full row count in the new, to 'fill' it using a specialised function to achieve it. I did not use append however. – Alex Oct 01 '21 at 14:30
  • @Alex it seems like you can reindex the smaller dataframe with the larger dataframe's index -- but again your comment is still a bit ambiguous. Consider opening a new question if you would like some suggestions on how to improve your existing code (if possible) cheers! – cs95 Oct 01 '21 at 21:15
  • @cs95 appreciate the offer - I think my code is fine, its working well, and fast over large data sets. I’m not doing anything outside of your suggestions… I’m not growing a data set as I’m initlaizing it with the right number of rows. I’m then using .loc in chunks to fill the column. It actually works really nicely. – Alex Oct 02 '21 at 22:17
  • The problem with re-indexing is that backfill or pad etc - none of the options matched my use case, so I wrote a simple chunking fill function. – Alex Oct 03 '21 at 10:31
  • Sorry @cs95 I am trying to understand this line `a, b, c in some_function_that_yields_data()` does the yield data function have to return these lists a, b, c? Or are a, b, c in the global environment? How do you preserve type? I am needing to transpose, and then getting a pd of objects so I am not doing it right. Please can you explain `for a, b, c in some_function_that_yields_data()` line more. – micstr Nov 18 '21 at 15:35
  • 1
    @micstr a, b and c are individual, atomic scalar values rather than lists. If you already have the lists of data available, just call `pd.DataFrame([a_list, b_list, c_list])` – cs95 Nov 19 '21 at 09:45
  • what if you are using pandas inexpensively (i.e., datasets are small ~10 or less) and new items are appended/removed based on time order in hours/days rather than seconds and performance isn't important. Panda dataframes offer some really unique features for filtering and print nice tables for CLI tools that I can't really ignore and if I want to use it to manage objects, I think this is a use case that is ignored for dataframes unless there are better data structures out there (outside of reimplementing your own using lists and dicts). – LeanMan Feb 04 '22 at 22:24
  • I've responded to this question ad nauseam "what if the data is small" ... but what if it's not, and you don't know how to process it because you've only been working with small datasets until now that you could get by with suboptimal solutions? – cs95 Feb 06 '22 at 02:56
  • One problem I see with this answer is that if the OP is dealing with time-series data meaning a use case is to append data that comes available real-time then appending to the list and then creating a dataframe is also computationally expensive. Whether if its done this way or growing a dataframe iteratively, you deal with the same space complexity problem. – LeanMan Apr 13 '22 at 05:27
  • @LeanMan space complexity of list append is `O(n)` amortized, for dataframes it is `O(n ** 2)` so it is not the same. – cs95 Apr 15 '22 at 00:13
  • Interesting, where can I find a resource that goes over complexities for dataframes? I have been looking for one myself. – LeanMan Apr 16 '22 at 05:05
  • @LeanMan the most notable mention I am aware of is the Notes section of the [append docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html#pandas-dataframe-append) – cs95 Apr 17 '22 at 06:21
  • So, if appending row-wise is bad but appending column-wise (e.g. `df['new column'] = column`) is okay, why isn't it be best practice to append new rows as columns and then to transpose the df in the end (i.e. before printing/saving)? – fratajcz Oct 05 '22 at 11:03
  • Great answer! For me `df = pd.concat(data)` worked instead of `df = pd.DataFrame(data)` – SoakingHummer Feb 14 '23 at 15:08
  • So then you have to check to see if a data variable that might contain the panda dataframes is an empty list or an empty panda dataframe, and they have different ways to do that. – Steve3p0 May 17 '23 at 20:15
422

Here's a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs

To do these type of calculations for the data, use a NumPy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]:
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 2
    pd.date_range() does not work for me. I tried with DateRange (from eclipse's autocompletion), but that works with strings as date format, right? The overall approach works though (I changed index to something else). – Matthias Kauer Dec 15 '12 at 08:42
  • Is date_range a new feature that I may get by upgrading or is it in datatime (I changed to datetime, b/c I thought that's what you meant) – Matthias Kauer Dec 15 '12 at 08:47
  • 3
    date_range is a factory function for creating datetime indexes and was [a new feature in 0.8.0](http://pandas.pydata.org/pandas-docs/version/0.8.0/whatsnew.html?highlight=date_range), I would definitely recommend upgrading to the latest stable release (0.9.1) there are many bug fixes and new features. :) – Andy Hayden Dec 15 '12 at 09:52
  • 35
    In my experiences, creating a data frame of the necessary size filled with NaNs, and then filling up with values is much-much slower than creating a data frame with `index` x `0` dimensions (`columns = []`), and attaching one column in each turn of a loop. I mean `df[col_name] = pandas.Series([...])` in a loop iterating through column names. In the former case, not only the memory allocation takes time, but replacing NaNs with new values seems extremely slow. – deeenes Mar 03 '15 at 16:33
  • 7
    @deeenes definitely. this answer should probably make that clearer - you very rarely (if ever) want to do create an empty Dataframe (of NaNs). – Andy Hayden Mar 03 '15 at 17:33
  • 1
    As per this answer http://stackoverflow.com/a/30267881/2302569 You need to assign the result of fillna, or pass param inplace=True – JayJay Jan 02 '17 at 20:09
226

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()

Note: append() is deprecated since version 1.4.0. Use concat().

desertnaut
  • 57,590
  • 26
  • 140
  • 166
geekidharsh
  • 3,549
  • 1
  • 20
  • 24
  • 24
    Please note that `append` (and similarly `concat`) copies the full dataset to a new object every time, hence, iterating and appending can and will cause a major performance hit. for more info refer to: http://pandas.pydata.org/pandas-docs/stable/merging.html – MoustafaAAtta Sep 18 '17 at 12:21
  • 4
    @MoustafaAAtta What are the alternatives to append iteratively data to the dataframe ? – MysteryGuy Aug 13 '18 at 11:24
  • 2
    @MoustafaAAtta Is Fred answer in this post : https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe better on this point of view ? – MysteryGuy Aug 13 '18 at 11:29
  • @MoustafaAAtta you can perhaps append just rows to a dataframe, it will still create a new object but for smaller datasets, might be useful. https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#appending-rows-to-a-dataframe – geekidharsh Jan 28 '20 at 21:28
  • 1
    Note that append method is officially deprecated check the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html – rubengavidia0x Mar 08 '22 at 20:08
165

Initialize empty frame with column names

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

Performance considerations

If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.

If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.

cs95
  • 379,657
  • 97
  • 704
  • 746
Afshin Amiri
  • 3,438
  • 1
  • 20
  • 21
2

Simply:

import numpy as np
import pandas as pd

df=pd.DataFrame(np.zeros([rows,columns])

Then fill it.

razimbres
  • 4,715
  • 5
  • 23
  • 50
  • And for the equivalent NaN-initalized array, use [the device driver I wrote](https://github.com/njwhite/devnan/blob/c472a244b0d8a60daca35ddd2a65b14f0f2d0451/test_driver.py#L21-L22)! – Nicholas White Feb 09 '22 at 21:33
1

Assume a dataframe with 19 rows

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

test['A']=10

Keeping column b as a variable given by a loop

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value

Sociopath
  • 13,068
  • 19
  • 47
  • 75
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
0

This is my way to make a dynamic dataframe from several lists with a loop

x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']

a loop

def dataF(x,y,z,names):
    res = []

    for t in zip(x,y,z):
        res.append(t)

    return pd.DataFrame(res,columns=names)

Result

dataF(x,y,z,names)

enter image description here

Wojciech Moszczyński
  • 2,893
  • 21
  • 27
0

Pandas dataframes can be thought of as a dictionary of pandas columns (pandas Series). Just like a dictionary where adding a new key-value pair is inexpensive, adding a new column/columns is very efficient (and dataframes are meant to be grown horizontally like that).

df = pd.DataFrame()
df['A'] = range(0, 2000_000, 2)   # add one column
df[['B', 'C']] = ['a', 'b']       # add multiple columns

On the other hand, just like updating every value of a dictionary requires looping over the entire dictionary, enlarging a dataframe vertically by adding new rows is very inefficient. It's especially inefficient if new rows are added one-by-one in a loop (see this post for a benchmark comparing the possible options).

If new row values depend on previous row values as in the OP, then depending on the number of columns, it might be better to loop over a pre-initialized dataframe of zeros or grow a Python dictionary in a loop and construct a dataframe after (if there are more than 500 columns, it's probably better to loop over the dataframe). But it's never optimal to mix the two, in other words, growing a dictionary of pandas Series will be extremely slow.1

dates = pd.date_range(end=pd.Timestamp('now'), periods=10000, freq='D').date
symbols = [f"col{i}" for i in range(10)]

# initialize a dataframe
df = pd.DataFrame(0, index=dates, columns=symbols)
# update it in a loop
for i, thedate in enumerate(df.index):
    if thedate > df.index[0]:
        df.loc[thedate] = df.loc[df.index[i-1]] + 1


# build a nested dictionary
data = {}
for i, thedate in enumerate(dates):
    for symb in symbols:
        if thedate > dates[0]:
            data[symb][thedate] = 1 + data[symb][dates[i-1]]
        else:
            data[symb] = {thedate: 0}
# construct a dataframe after
df1 = pd.DataFrame(data)

1: That said, for this specific example, cumsum() or even range() would seemingly work without even looping over the rows. This part of the answer is more about cases where looping is unavoidable, such as financial data manipulation etc.

cottontail
  • 10,268
  • 18
  • 50
  • 51