1

I have a Dataframe (df) that looks like this.

   Main     Col_1    Col_2     Col_3
0     v1     1        0         0
1     v2     0        1         1
2     v1     1        1         0
3     v2     1        0         1
4     v5     1        0         0
5     v2     1        0         0

I'm creating a new Dataframe based on unique values in Main column. i.e. Iterating through every row and when encounter a new value in Main column, add that row to new DataFrame.

New DataFrame (new_df) should look like this.

   Main     Col_1    Col_2     Col_3
0     v1     1        0         0
1     v2     0        1         1
2     v5     1        0         0

My current approach is iterating through every row and ...

unique_message_list = []
new_df_list = []

for index, row in df.iterrows():
    if row['Main'] not in unique_message_list:
        unique_message_list.append(row['Main'])
        new_df_list.append(row.tolist())

new_df = pd.DataFrame(new_df_list, columns=['Main', 'Col_1', 'Col_2', 'Col_3'])

But df has 1 Million rows so it takes time to process it with iterating. How to solve it efficiently?

Shaida Muhammad
  • 1,428
  • 14
  • 25

1 Answers1

2

The easiest way would be to use groupby -

And populate the first occurrences of the column values

Group By

>>> import pandas as pd
>>> 
>>> d = {
...   'Main':['v1','v2','v1','v2','v5','v2']
...   ,'Col1':[1,0,1,1,1,1]
...   ,'Col2':[0,1,1,0,0,0]
...   ,'Col3':[0,1,0,1,0,0]
... }
>>> 
>>> df = pd.DataFrame(d)
>>> 
>>> df.groupby('Main').agg('first')
      Col1  Col2  Col3
Main                  
v1       1     0     0
v2       0     1     1
v5       1     0     0
>>> df.groupby('Main').agg('first').reset_index()
  Main  Col1  Col2  Col3
0   v1     1     0     0
1   v2     0     1     1
2   v5     1     0     0

Drop Duplicates

>>> df.drop_duplicates(subset='Main')
  Main  Col1  Col2  Col3
0   v1     1     0     0
1   v2     0     1     1
4   v5     1     0     0
Vaebhav
  • 4,672
  • 1
  • 13
  • 33