I'm going to drop duplicates by one of the columns, but some of the rows can be dropped.
the wired thing is: if i read the 2 files directly instead of by my func1, func2, then apply the drop function, every thing is fine!
update1: highly like is the unicode problem(thx to furkanayd), how to solve it?
Is there anyone can help? thx! Here is my code:
def func1(file):
try:
df1 = pd.read_csv('balba', encoding='utf8', low_memory=False)
except UnicodeDecodeError:
df1 = pd.read_csv('balba', encoding='gb18030', low_memory=False)
"""select the col_name, then replace ' ' with ''
"""
return df1
def func2(file):
df2 = pd.read_csv('balba')
"""select the col_name, then replace ' ' with '', then rename the column name
"""
turn df2
df2 = func2(file_df2)
DF1 = []
for i in ['one_file_this_time']:
d = func1(i)
DF1.append(d)
df1 = pd.concat([DF1], sort=False)
df1.drop_duplicates(inplace=True)
df = pd.concat([df1, df2], sort=False)
print(df.shape)
# (7749, 2)
df.drop_duplicates(subset='col_name', inplace=True)
print(df.shape)
print(df.duplicated().any())
# (5082, 2)
# False
"""obviously the drop_duplicates() functions works, but not fullly"""
before drop function, the concatenated data is(i stored it to csv format):