2

Got a dataframe df with a column "Id"

     Id
0    -KkJz3CoJNM
1    08QMXEQbEWw
2    0ANuuVrIWJw
3    0pPU8CtwXTo
4    1-wYH2LEcmk

I need to convert column "Id" into a set() but

set_id = set(df["Id"])
print(set_id)

returns

{'Id'}

instead of a set() of the strings from column "Id"?

Vega
  • 2,661
  • 5
  • 24
  • 49

1 Answers1

5

For me working correctly if exist only one id column:

set_id = set(df["Id"])
print(set_id)
{'1-wYH2LEcmk', '08QMXEQbEWw', '0pPU8CtwXTo', '0ANuuVrIWJw', '-KkJz3CoJNM'}

But if there is more columns names id then df['id'] return DataFrame, so set(df["Id"]) return unique columns names:

#test for 2 columns with sample data
df = pd.concat([df, df], axis=1)
print (df["Id"])
            Id           Id
0  -KkJz3CoJNM  -KkJz3CoJNM
1  08QMXEQbEWw  08QMXEQbEWw
2  0ANuuVrIWJw  0ANuuVrIWJw
3  0pPU8CtwXTo  0pPU8CtwXTo
4  1-wYH2LEcmk  1-wYH2LEcmk

set_id = set(df["Id"])
print(set_id)
{'Id'}

Because:

L = list(df["Id"])
print(L)
['Id', 'Id']

working same like

L = list(df["Id"].columns)
print(L)
['Id', 'Id']

and similar for sets:

set_id = set(df["Id"].columns)
print(set_id)
{'Id'}

Possible solution for deduplicate columns:

c = df.columns.to_series()

df.columns += c.groupby(c).cumcount().astype(str).radd('.').replace('.0','')
print (df)
            Id         Id.1
0  -KkJz3CoJNM  -KkJz3CoJNM
1  08QMXEQbEWw  08QMXEQbEWw
2  0ANuuVrIWJw  0ANuuVrIWJw
3  0pPU8CtwXTo  0pPU8CtwXTo
4  1-wYH2LEcmk  1-wYH2LEcmk

Or if always same values remove duplicated columns:

df = df.loc[:, ~df.columns.duplicated()]
print (df)
            Id
0  -KkJz3CoJNM
1  08QMXEQbEWw
2  0ANuuVrIWJw
3  0pPU8CtwXTo
4  1-wYH2LEcmk
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I do have "Id" twice for some odd reason. But df = df.drop_duplicates() does not work somehow? Still twice "Id"? – Vega Mar 25 '20 at 12:07
  • @Vega - Then is necessary `df = df.loc[:, ~df.columns.duplicated()]`. – jezrael Mar 25 '20 at 12:08
  • Your solution seems to work but how can .drop_duplicates() not work? Isn't that the exact usecase for this? – Vega Mar 25 '20 at 12:10
  • @Vega - I think you can check [this](https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns) – jezrael Mar 25 '20 at 12:12
  • That thread does not explain why "drop_duplicates()" does not work? – Vega Mar 25 '20 at 12:37
  • 1
    @Vega - OK, It working if transpose - like `df = df.T.drop_duplicates().T` - Because be default pandas remove duplicates by rows, not by columns. – jezrael Mar 25 '20 at 12:39
  • 1
    @Vega - Because not exist like `df = df.drop_duplicates(axis=1)` – jezrael Mar 25 '20 at 12:40