21

Taking the following example:

>>> df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5], 
                        "y":[3, 4, 5, 6, 7]}, 
                      index=['a', 'b', 'c', 'd', 'e'])

>>> df2 = pd.DataFrame({"y":[1, 3, 5, 7, 9], 
                        "z":[9, 8, 7, 6, 5]}, 
                      index=['b', 'c', 'd', 'e', 'f'])

>>> pd.concat([df1, df2], join='inner')

The output is:

   y
a  3
b  4
c  5
d  6
e  7
b  1
c  3
d  5
e  7
f  9

Since axis=0 is the columns, I think tha concat() only considers columns that are found in both dataframes. But the acutal output considers rows that are found in both dataframes.

What is the exactly meaning of axis parameter?

ivanleoncz
  • 9,070
  • 7
  • 57
  • 49
  • It's not about the axis argument. It's about ```join='inner'```. Look up the docs! ```join: {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection``` – sascha Sep 02 '16 at 02:15
  • 2
    Think Roman Catholic. or "R" - "C" or Row - Columns -> Zero or One. – Merlin Sep 02 '16 at 04:01

5 Answers5

36

If someone needs visual description, here is the image:

Axis 0 or 1 in Pandas Python

Debashis Sahoo
  • 5,388
  • 5
  • 36
  • 41
13

Data:

In [55]: df1
Out[55]:
   x  y
a  1  3
b  2  4
c  3  5
d  4  6
e  5  7

In [56]: df2
Out[56]:
   y  z
b  1  9
c  3  8
d  5  7
e  7  6
f  9  5

Concatenated horizontally (axis=1), using index elements found in both DFs (aligned by indexes for joining):

In [57]: pd.concat([df1, df2], join='inner', axis=1)
Out[57]:
   x  y  y  z
b  2  4  1  9
c  3  5  3  8
d  4  6  5  7
e  5  7  7  6

Concatenated vertically (DEFAULT: axis=0), using columns found in both DFs:

In [58]: pd.concat([df1, df2], join='inner')
Out[58]:
   y
a  3
b  4
c  5
d  6
e  7
b  1
c  3
d  5
e  7
f  9

If you don't use the inner join method - you will have it this way:

In [62]: pd.concat([df1, df2])
Out[62]:
     x  y    z
a  1.0  3  NaN
b  2.0  4  NaN
c  3.0  5  NaN
d  4.0  6  NaN
e  5.0  7  NaN
b  NaN  1  9.0
c  NaN  3  8.0
d  NaN  5  7.0
e  NaN  7  6.0
f  NaN  9  5.0

In [63]: pd.concat([df1, df2], axis=1)
Out[63]:
     x    y    y    z
a  1.0  3.0  NaN  NaN
b  2.0  4.0  1.0  9.0
c  3.0  5.0  3.0  8.0
d  4.0  6.0  5.0  7.0
e  5.0  7.0  7.0  6.0
f  NaN  NaN  9.0  5.0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
11

This is my trick with axis: just add the operation in your mind to make it sound clear:

  • axis 0 = rows
  • axis 1 = columns

If you “sum” through axis=0, you are summing all rows, and the output will be a single row with the same number of columns. If you “sum” through axis=1, you are summing all columns, and the output will be a single column with the same number of rows.

Rod292
  • 111
  • 1
  • 4
4

First, OP misunderstood the rows and columns in his/her dataframe.

But the acutal output considers rows that are found in both dataframes.(the only common row element 'y')

OP thought the label y is for row. However, y is a column name.

df1 = pd.DataFrame(
         {"x":[1, 2, 3, 4, 5],  # <-- looks like row x but actually col x
          "y":[3, 4, 5, 6, 7]}, # <-- looks like row y but actually col y
          index=['a', 'b', 'c', 'd', 'e'])
print(df1)

            \col   x    y
 index or row\
          a       1     3   |   a
          b       2     4   v   x
          c       3     5   r   i
          d       4     6   o   s
          e       5     7   w   0

               -> column
                 a x i s 1

It is very easy to be misled since in the dictionary, it looks like y and x are two rows.

If you generate df1 from a list of list, it should be more intuitive:

df1 = pd.DataFrame([[1,3], 
                    [2,4],
                    [3,5],
                    [4,6],
                    [5,7]],
                    index=['a', 'b', 'c', 'd', 'e'], columns=["x", "y"])

So back to the problem, concat is a shorthand for concatenate (means to link together in a series or chain on this way [source]) Performing concat along axis 0 means to linking two objects along axis 0.

   1
   1   <-- series 1
   1
^  ^  ^
|  |  |               1
c  a  a               1
o  l  x               1
n  o  i   gives you   2
c  n  s               2
a  g  0               2
t  |  |
|  V  V
v 
   2
   2   <--- series 2
   2

So... think you have the feeling now. What about sum function in pandas? What does sum(axis=0) means?

Suppose data looks like

   1 2
   1 2
   1 2

Maybe...summing along axis 0, you may guess. Yes!!

^  ^  ^
|  |  |               
s  a  a               
u  l  x                
m  o  i   gives you two values 3 6 !
|  n  s               
v  g  0               
   |  |
   V  V

What about dropna? Suppose you have data

   1  2  NaN
  NaN 3   5
   2  4   6

and you only want to keep

2
3
4

On the documentation, it says Return object with labels on given axis omitted where alternately any or all of the data are missing

Should you put dropna(axis=0) or dropna(axis=1)? Think about it and try it out with

df = pd.DataFrame([[1, 2, np.nan],
                   [np.nan, 3, 5],
                   [2, 4, 6]])

# df.dropna(axis=0) or df.dropna(axis=1) ?

Hint: think about the word along.

Tai
  • 7,684
  • 3
  • 29
  • 49
3

Interpret axis=0 to apply the algorithm down each column, or to the row labels (the index).. A more detailed schema here.

If you apply that general interpretation to your case, the algorithm here is concat. Thus for axis=0, it means:

for each column, take all the rows down (across all the dataframes for concat) , and do contact them when they are in common (because you selected join=inner).

So the meaning would be to take all columns x and concat them down the rows which would stack each chunk of rows one after another. However, here x is not present everywhere, so it is not kept for the final result. The same applies for z. For y the result is kept as y is in all dataframes. This is the result you have.

Community
  • 1
  • 1
Zeugma
  • 31,231
  • 9
  • 69
  • 81