16

I have the following Dataframes:

Dataframe 1:

|---------------------|------------------|
|      property_id    |        beds      |
|---------------------|------------------|
|          1          |         1        |
|---------------------|------------------|
|          2          |         2        | 
|---------------------|------------------|

Dataframe 2:

|---------------------| 
|      property_id    |
|---------------------|
|          3          |
|---------------------|
|          4          |
|---------------------|

What I want to produce is the following Dataframe:

|---------------------|------------------|
|      property_id    |        beds      |
|---------------------|------------------|
|          1          |         1        |
|---------------------|------------------|
|          2          |         2        | 
|---------------------|------------------|
|          3          |         0        |
|---------------------|------------------|
|          4          |         0        | 
|---------------------|------------------|

What I want is to concatenate two Dataframes, and the former has more columns than the latter, but all the columns of the latter are in the former. When the column is not present in the latter dataframe I want to set a default value of 0. How can I achieve this?

df1 = pd.DataFrame({'property_id': [1, 2], 'beds': [1, 2]})
df2 = pd.DataFrame({'property_id': [3, 4]})

I have almost no experience with pandas, so what could I do?

lmiguelvargasf
  • 63,191
  • 45
  • 217
  • 228

2 Answers2

18

You can use pandas.concat or append method for this, both methods will generate NA for columns that don't exist in the sub data frame, to fill them with zero, you can use fillna(0):

df1.append(df2).fillna(0)

#  beds     property_id
#0  1.0          1
#1  2.0          2
#0  0.0          3
#1  0.0          4


pd.concat([df1, df2]).fillna(0)

#  beds     property_id
#0  1.0         1
#1  2.0         2
#0  0.0         3
#1  0.0         4
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • thanks, it worked? which one is more efficient when the time is a concern? – lmiguelvargasf Apr 24 '17 at 02:14
  • 1
    Some naive timing shows they are about similarly fast, but if you have a list of data frames more than two, `pd.concat` has an advantage since it can be done in one single command as `pd.concat(list_of_dataframes)` while `append` can't. But strictly speaking, I don't have a lot of knowledge of the time comparison of the two methods. I use `pd.concat` more often. – Psidom Apr 24 '17 at 02:20
  • Why order of columns is changed after concatenation? How can we get the orginal order of columns as in df1? – Math122 Dec 08 '21 at 22:53
  • 1
    From version 1.4.0 on, `pd.append` is [deprecated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html). – Philipp Jul 25 '22 at 14:57
5
df1.append(df2.reindex_axis(df1.columns, 1, fill_value=0))

The advantage is that integer types should be preserved

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    what do you mean by "integer types should be preserved"? – lmiguelvargasf Apr 24 '17 at 02:18
  • 2
    @lmiguelvargasf what I mean is that when you concat, append, reindex, or otherwise introduce a potential nan in a column of integers, you automatically convert that column to floats. By reindexing with the other dataframes columns and using the fill_value option, we tell pandas to not convert but keep as ints and fill the missing with zero. Also, this would work equally well with concat. – piRSquared Apr 24 '17 at 02:24
  • Sadly, `append` does not work as suggested: resulting table has columns number of the narrower (not wider) of two tables. And `reindex_axis` is not just deprecated, it no longer exists... this needs refactoring IMO. – mirekphd Aug 22 '21 at 09:11