6

I am trying to remove duplicates data in my dataframe (csv) and get a separate csv to show the unique answers of each column. The problem is that my code has been running for a day (22 Hours to be exact) I´m open to some other suggestions.

My data has about 20,000 rows with headers (example). I have tried to check the unique list one by one before like df[col].unique() and it does not take that long.

df = pd.read_csv('Surveydata.csv')
df_uni = df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
df_uni.to_csv('Surveydata_unique.csv', index=False)

What I expect is the dataframe that has the same set of columns but without any duplication in each field (example). Ex. if df['Rmoisture'] has a combination of Yes,No,Nan it should have only these 3 contain in the same column of another dataframe df_uni.

MERose
  • 4,048
  • 7
  • 53
  • 79
AOJ keygen
  • 103
  • 1
  • 8
  • @AOJkeygen - Is order of values important? – jezrael Jan 15 '19 at 11:52
  • Slightly related question, what would the best way to refactor this solution for it to mimic calling df.drop_duplicates(). I am trying to optimize a script that contains a lot of drop_duplicates() calls. – Paul Russell Oct 06 '21 at 22:57

2 Answers2

6

Another method:

new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)


   Mass     Length  Material  Special Mark  Special Num  Breaking  \
0    4.0   5.500000     Wood            A         20.0      Yes   
1   12.0   2.600000    Steel          NaN          NaN       No   
2    1.0   3.500000   Rubber            B          5.5      NaN   
3   15.0   6.500000  Plastic            X          6.6      NaN   
4    6.0  12.000000      NaN          NaN          5.6      NaN   
5   14.0   2.500000      NaN          NaN          6.3      NaN   
6    2.0  15.000000      NaN          NaN          NaN      NaN   
7    8.0   2.000000      NaN          NaN          NaN      NaN   
8    7.0  10.000000      NaN          NaN          NaN      NaN   
9    9.0   2.200000      NaN          NaN          NaN      NaN   
10  11.0   4.333333      NaN          NaN          NaN      NaN   
11  13.0   4.666667      NaN          NaN          NaN      NaN   
12   NaN   3.750000      NaN          NaN          NaN      NaN   
13   NaN   1.666667      NaN          NaN          NaN      NaN   

                  Comment  
0        There is no heat  
1                     NaN  
2       Contains moisture  
3   Hit the table instead  
4          A sign of wind  
5                     NaN  
6                     NaN  
7                     NaN  
8                     NaN  
9                     NaN  
10                    NaN  
11                    NaN  
12                    NaN  
13                    NaN  
anky
  • 74,114
  • 11
  • 41
  • 70
3

If order of values in columns is not important convert each column to set for remove duplicates, then to Series and join together by concat:

df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)

If order is important:

df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)

Performance 1k unique values in 2k rows:

np.random.seed(2019)

#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)


In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance 100 unique values in 2k rows

df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)

In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252