Python pandas: fill a dataframe row by row

Question

The simple task of adding a row to a pandas.DataFrame object seems to be hard to accomplish. There are 3 stackoverflow questions relating to this, none of which give a working answer.

Here is what I'm trying to do. I have a DataFrame of which I already know the shape as well as the names of the rows and columns.

>>> df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
>>> df
     a    b    c    d
x  NaN  NaN  NaN  NaN
y  NaN  NaN  NaN  NaN
z  NaN  NaN  NaN  NaN

Now, I have a function to compute the values of the rows iteratively. How can I fill in one of the rows with either a dictionary or a pandas.Series ? Here are various attempts that have failed:

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df['y'] = y
AssertionError: Length of values does not match length of index

Apparently it tried to add a column instead of a row.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.join(y)
AttributeError: 'builtin_function_or_method' object has no attribute 'is_unique'

Very uninformative error message.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.set_value(index='y', value=y)
TypeError: set_value() takes exactly 4 arguments (3 given)

Apparently that is only for setting individual values in the dataframe.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.append(y)
Exception: Can only append a Series if ignore_index=True

Well, I don't want to ignore the index, otherwise here is the result:

>>> df.append(y, ignore_index=True)
     a    b    c    d
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3    1    5    2    3

It did align the column names with the values, but lost the row labels.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.ix['y'] = y
>>> df
                                  a                                 b  \
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

                                  c                                 d
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

That also failed miserably.

So how do you do it ?

Note that its quite inefficient to add data row by row and for large sets of data. Instead it would be much faster to first load the data into a list of lists and then construct the DataFrame in one line using `df = pd.DataFrame(data, columns=header)` — Timothy C. Quinn, Dec 04 '20 at 20:35
Why is it more efficient to create the dataset in Lists, and the seemingly duplicate the entire dataset in memory as a DataFrame? That sounds very inefficient in terms of memory usage - and would presumably be a problem for very huge datasets. — Demis, Mar 19 '21 at 05:07
@xApple, I think you ran into the same problem I had (for days), where I didn't understand the difference between Columns and Index - I was thinking in terms of arrays, where these could basically be row/col or vice versa, no difference. I totally agree with you that this basic theory of how the dataframe is expected to be used, and how to generate a DF line by line (typical when reading data from another source) is remarkably unclear! — Demis, Mar 19 '21 at 05:10

score 129 · Accepted Answer · edited Jun 24 '21 at 16:13

129

df['y'] will set a column

since you want to set a row, use .loc

Note that .ix is equivalent here, yours failed because you tried to assign a dictionary to each element of the row y probably not what you want; converting to a Series tells pandas that you want to align the input (for example you then don't have to to specify all of the elements)

In [6]: import pandas as pd

In [7]: df = pd.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])

In [8]: df.loc['y'] = pd.Series({'a':1, 'b':5, 'c':2, 'd':3})

In [9]: df
Out[9]: 
     a    b    c    d
x  NaN  NaN  NaN  NaN
y    1    5    2    3
z  NaN  NaN  NaN  NaN

edited Jun 24 '21 at 16:13

Daniel

11,332
9
44
72

answered Jun 13 '13 at 16:19

Jeff

125,376
21
220
187

I see. So the `loc` attribute of the data frame defines a special `__setitem__` that does the magic I suppose. – xApple Jun 13 '13 at 16:24
Can you construct this in one pass (i.e. with columns, index and y)? – Andy Hayden Jun 13 '13 at 16:24
5

So if I can generate one row at a time, how would I construct the data frame optimally ? – xApple Jun 13 '13 at 16:25
Was expecting some variant of `df = pd.DataFrame({'y': pd.Series(y)}, columns=['a','b','c','d'], index=['x','y','z'])` to work? – Andy Hayden Jun 13 '13 at 16:27
2

@xApple prob best for you to construct a list of dicts (or list), then just pass to the constructor, will be much more efficient – Jeff Jun 13 '13 at 16:58
@Jeff Also, lol to the TOTD comment if you didn't see my reply. :) – Andy Hayden Jun 13 '13 at 17:07
@AndyHayden you deleted your answer so didn't see it, but lol anyhow about the whole TOTD! – Jeff Jun 13 '13 at 17:16
@Jeff Isn't it inefficient to construct a new `pandas.Series` for every row? Wouldn't it be better to fill a pre-created series object? – KeithWM Sep 20 '16 at 13:53
What happens if the argument of a `pandas.series` must be a list of pre-computed values, instead of a dictionary which items are specified one by one? I am trying with `df.loc['y'] = pd.Series(mylist,index=df.index)` but it does not work. – FaCoffee Nov 23 '16 at 11:26
what if you do not know the number of indices? Can you add as you go and not initialize index? – amc May 15 '17 at 20:39
2

@amc yes, you can also do `df = pandas.DataFrame(columns=['a', 'b', 'c', 'd']); df.loc['y'] = [1, 5, 2, 3]` – Max Ghenis Dec 11 '20 at 21:29

fses91 · Answer 2 · 2022-04-15T12:13:45.067

104

Update: because append has been deprecated

df = pd.DataFrame(columns=["firstname", "lastname"])

entry = pd.DataFrame.from_dict({
     "firstname": ["John"],
     "lastname":  ["Johny"]
})

df = pd.concat([df, entry], ignore_index=True)

edited Apr 15 '22 at 12:13

answered Mar 16 '17 at 15:00

fses91

1,812
1
11
16

7

This worked brilliantly for me and I like the fact that you explicitly `append` the data to the dataframe. – Jonny Brooks Apr 21 '17 at 07:49
2

Note that this answer needs each row to have the column name appended. Same for the accepted answer. – pashute Nov 14 '17 at 06:53
1

This works too if you don't know the number of rows in advance. – irene May 26 '18 at 11:36
3

This is the best you can do if building line by line but with large data sets, even with the `ignore_index=True`, its definitely way faster to load the data into a list of lists and then construct the DataFrame in one line using `df = pd.DataFrame(data, columns=header). It seems that pandas does some pretty heavy lifting when appending rows regardless of index processing. – Timothy C. Quinn Dec 04 '20 at 20:38
@TimothyC.Quinn What about appending a DataFrame to another DataFrame? Would it be more efficient to do `df_to_append = pd.DataFrame([{'firstname': 'John', 'lastname': 'Smith'}, {...}])` followed by `df = df.append(df_to_append, ignore_index=True)`? – Ben Jan 09 '21 at 15:36
1

@Ben - I have not tested but it should be much faster to concatenate two databases, as you show, rather than adding rows one at a time. However, for small datasets the time difference may not be noticable to the eye. – Timothy C. Quinn Jan 09 '21 at 17:27
1

not that [append is deprecated in favor of concat](https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#whatsnew-140-deprecations-frame-series-append) – galath Mar 14 '22 at 13:58

score 49 · Answer 3 · edited Nov 20 '19 at 07:28

49

This is a simpler version

import pandas as pd
df = pd.DataFrame(columns=('col1', 'col2', 'col3'))
for i in range(5):
   df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']`

edited Nov 20 '19 at 07:28

Rajitha Fernando

1,655
15
14

answered Nov 09 '16 at 07:25

Satheesh

1,252
11
11

6

just want to ask, is this CPU and memory efficient? – czxttkl Jun 29 '17 at 18:52
2

how do i know df's last row so I append to the last row each time? – pashute Nov 14 '17 at 06:52
Compared to the other two options of `append()` (which possibly duplicates the whole database (as you reassign to itself) on every loop iteration), and the other common option of creating two identical datastructures (a `List` and then a `DataFrame`) of the same data, this seems much more "efficient" in terms of memory usage, but speed might be another issue entirely. – Demis Mar 19 '21 at 05:19
Maybe you can do `df.loc[-1]`? – Demis Mar 19 '21 at 05:20
You can add data to the end of the DataFrame with: `df.loc[ len(df) ] = ["My", "new", "Data"]` – Demis Mar 20 '21 at 08:46

score 39 · Answer 4 · answered Aug 03 '17 at 21:46

39

If your input rows are lists rather than dictionaries, then the following is a simple solution:

import pandas as pd
list_of_lists = []
list_of_lists.append([1,2,3])
list_of_lists.append([4,5,6])

pd.DataFrame(list_of_lists, columns=['A', 'B', 'C'])
#    A  B  C
# 0  1  2  3
# 1  4  5  6

answered Aug 03 '17 at 21:46

stackoverflowuser2010

38,621
48
169
217

but what do I do if I have a multi index? df1 = pd.DataFrame(list_of_lists, columns['A', 'B', 'C'], index=['A', 'B']) does not work. Wrong shape. So how? – pashute Nov 14 '17 at 06:56

score 2 · Answer 5 · answered Jun 08 '21 at 17:40

The logic behind the code is quite simple and straight forward

Make a df with 1 row using the dictionary

Then create a df of shape (1, 4) that only contains NaN and has the same columns as the dictionary keys

Then concatenate a nan df with the dict df and then another nan df

import pandas as pd
import numpy as np

raw_datav = {'a':1, 'b':5, 'c':2, 'd':3} 

datav_df = pd.DataFrame(raw_datav, index=[0])

nan_df = pd.DataFrame([[np.nan]*4], columns=raw_datav.keys())

df = pd.concat([nan_df, datav_df, nan_df], ignore_index=True)

df.index = ["x", "y", "z"]

print(df)

gives

a    b    c    d
x  NaN  NaN  NaN  NaN
y  1.0  5.0  2.0  3.0
z  NaN  NaN  NaN  NaN

[Program finished]

Python pandas: fill a dataframe row by row

5 Answers5

Linked