0

I have multiple datasets which has same columns name as below example, I want the columns which are repeated in multiple datasets sort out in list format using python and pandas.

df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'C': np.arange(8), 
               'D': np.arange(8) * 2})
df2 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'C': np.arange(8)})
df3 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'D': np.arange(8) * 2})

As from above we can see in three Datasets df1, df2, df3 has repeated columns as 'A', 'B' and the output as ['A', 'B'] Please can give solution to this problem. Thanks in Advance

Kedar17
  • 178
  • 2
  • 14

3 Answers3

0

Pandas columns are of type pandas.core.indexes.base.Index you could use the intersection function in them to find the overlapping elements. Here is an example below

import pandas as pd
import numpy as np

a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})

out:

    a   b   c   d
0   1   5   5   1
1   2   6   7   5
2   3   7   6   9

a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2

out:

    a   b   e   f
0   4   7   9   9
1   5   8   9   3
2   6   9   2   1

df_1.columns.intersection(df_2.columns)

out:

Index(['a', 'b'], dtype='object')

type(df_1.columns)

out:

pandas.core.indexes.base.Index
vumaasha
  • 2,765
  • 4
  • 27
  • 41
0

Pandas can get list of column names for you. For example,df1.columns will return ['A','B','C','D']. Likewise you can get the list of column names for each dataframe.

Then you can just find out the intersection of all these lists.

Shridhar R Kulkarni
  • 6,653
  • 3
  • 37
  • 57
0

I think simpliest is & for intersection of all columns names:

a = df1.columns & df2.columns & df3.columns
print (a)
Index(['A', 'B'], dtype='object')

If need list:

a = (df1.columns & df2.columns & df3.columns).tolist()
print (a)
['A', 'B']
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252