Pandas create new dataframe based on unique value in a column of existing dataframe efficiently

Question

I have a Dataframe (df) that looks like this.

   Main     Col_1    Col_2     Col_3
0     v1     1        0         0
1     v2     0        1         1
2     v1     1        1         0
3     v2     1        0         1
4     v5     1        0         0
5     v2     1        0         0

I'm creating a new Dataframe based on unique values in Main column. i.e. Iterating through every row and when encounter a new value in Main column, add that row to new DataFrame.

New DataFrame (new_df) should look like this.

   Main     Col_1    Col_2     Col_3
0     v1     1        0         0
1     v2     0        1         1
2     v5     1        0         0

My current approach is iterating through every row and ...

unique_message_list = []
new_df_list = []

for index, row in df.iterrows():
    if row['Main'] not in unique_message_list:
        unique_message_list.append(row['Main'])
        new_df_list.append(row.tolist())

new_df = pd.DataFrame(new_df_list, columns=['Main', 'Col_1', 'Col_2', 'Col_3'])

But df has 1 Million rows so it takes time to process it with iterating. How to solve it efficiently?

`df=df.drop_duplicates(subset='Main')` – Anurag Dabas Jun 30 '21 at 05:54 — Anurag Dabas, Jun 30 '21 at 05:54
`df.drop_duplicates(subset='Main')` – Ch3steR Jun 30 '21 at 05:55 — Ch3steR, Jun 30 '21 at 05:55

score 2 · Accepted Answer · answered Jun 30 '21 at 05:58

The easiest way would be to use groupby -

And populate the first occurrences of the column values

Group By

>>> import pandas as pd
>>> 
>>> d = {
...   'Main':['v1','v2','v1','v2','v5','v2']
...   ,'Col1':[1,0,1,1,1,1]
...   ,'Col2':[0,1,1,0,0,0]
...   ,'Col3':[0,1,0,1,0,0]
... }
>>> 
>>> df = pd.DataFrame(d)
>>> 
>>> df.groupby('Main').agg('first')
      Col1  Col2  Col3
Main                  
v1       1     0     0
v2       0     1     1
v5       1     0     0
>>> df.groupby('Main').agg('first').reset_index()
  Main  Col1  Col2  Col3
0   v1     1     0     0
1   v2     0     1     1
2   v5     1     0     0

Drop Duplicates

>>> df.drop_duplicates(subset='Main')
  Main  Col1  Col2  Col3
0   v1     1     0     0
1   v2     0     1     1
4   v5     1     0     0

Pandas create new dataframe based on unique value in a column of existing dataframe efficiently

1 Answers1

Group By

Drop Duplicates