You are on the right track, except that DataFrame.align
doesn't combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let's try an example:
Initialising two dataframes with some descriptive column names and toy data:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
Now, let's view these data frames by themselves:
print(df1)
D B E A
1 1 2 3 4
2 6 7 8 9
print(df2)
A B C D
2 10 20 30 40
3 60 70 80 90
4 600 700 800 900
Let's align these two dataframes, aligning by columns (axis=1
), and performing an outer join on column labels (join='outer'
):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
A B C D E
1 4 2 NaN 1 3
2 9 7 NaN 6 8
A B C D E
2 10 20 30 40 NaN
3 60 70 80 90 NaN
4 600 700 800 900 NaN
A few things to notice here:
- The columns in
df1
have been rearranged so they align with the columns in df2
.
- There is a column labelled
'C'
that has been added to df1
, and a column labelled 'E'
that has been added to df2
. These columns have been filled with NaN
. This is because we performed an outer join on the column labels.
- None of the values inside the DataFrames have been altered.
- Note that the row labels are not aligned;
df2
has rows 3
and 4
, whereas df1
does not. This is because we requested alignment on columns (axis=1
).
What happens if we align on both rows and columns, but change the join
parameter to 'right'
?
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
A B C D
2 9.0 7.0 NaN 6.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
A B C D
2 10 20 30 40
3 60 70 80 90
4 600 700 800 900
Note that:
- Only the columns and rows that are found in the "right" dataframe (
df2
) are retained. Column 'E'
is no longer present. This is because we made a right join on both the column and row labels.
- Rows with labels
3
and 4
have been added to df1
, filled with Nan
. This is because we requested alignment on both rows and columns (axis=None
).
- Row labels are now aligned as well as column labels.
- Again, note that none of the actual values within the dataframes have been altered.
Finally, let's have a look at the code in the question, with join='inner'
and axis=1
:
a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
D B A
1 1 2 4
2 6 7 9
D B A
2 40 20 10
3 90 70 60
4 900 700 600
- Only column labels are aligned (
axis=1
).
- Only column labels that are present in both
df1
and df2
are retained (join='inner'
).
In summary, use DataFrame.align()
when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.