python: drop_duplicates(subset='col_name', inplace=True), why some of the rows can not be dropped?

Question

I'm going to drop duplicates by one of the columns, but some of the rows can be dropped.

the wired thing is: if i read the 2 files directly instead of by my func1, func2, then apply the drop function, every thing is fine!

update1: highly like is the unicode problem(thx to furkanayd), how to solve it?

Is there anyone can help? thx! Here is my code:

def func1(file):
    try:
        df1 = pd.read_csv('balba', encoding='utf8', low_memory=False)
    except UnicodeDecodeError:
        df1 = pd.read_csv('balba', encoding='gb18030', low_memory=False)
    """select the col_name, then replace ' ' with ''
    """
    return df1

def func2(file):
    df2 = pd.read_csv('balba')
    """select the col_name, then replace ' ' with '', then rename the column name
    """
    turn df2


df2 = func2(file_df2)

DF1 = []
for i in ['one_file_this_time']:
    d = func1(i)
    DF1.append(d)
df1 = pd.concat([DF1], sort=False)
df1.drop_duplicates(inplace=True)

df = pd.concat([df1, df2], sort=False)
print(df.shape)
# (7749, 2)

df.drop_duplicates(subset='col_name', inplace=True)
print(df.shape)
print(df.duplicated().any())
# (5082, 2)
# False
"""obviously the drop_duplicates() functions works, but not fullly"""

before drop function, the concatenated data is(i stored it to csv format):

after drop function

It's difficult to portray your problem. Show us some sample of your data. — Yash Ghorpade, Dec 02 '19 at 08:59

score 0 · Answer 1 · answered Dec 02 '19 at 08:58

0

You should consider to add keep argument in your drop_duplicates method as mentioned here

Right now your code works with below principle:

first : Drop duplicates except for the first occurrence.

Merge without dıplicates may help you. This question is very similar with Pandas merge creates unwanted duplicate entries

answered Dec 02 '19 at 08:58

furkanayd

811
7
19

@still get the duplicated rows after `keep='first` was added~~ – Sean.H Dec 02 '19 at 09:28
@Sean.H Are you sure that values considered duplicates are not different by checking logical operators. – furkanayd Dec 02 '19 at 10:07
Yes, i'm sure.I made 2 test data file. 1) If I got the df1 & df2 by my func(pls find them above), then `concat` , finally `drop_duplicates()`, what i got is the problem one. However, 2) if i read the 2 test files directly instead of through my func, what i got is no problem~~ alas, can't figure out why. – Sean.H Dec 02 '19 at 10:13
As far as I understand from code and your posts, string type may cause this unicode translation of read_csv may help this issue as it is explained here : https://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python – furkanayd Dec 02 '19 at 10:20
Also you may want to check the value of duplicates after drop_duplicates method called by using ```df.duplicated().any()``` to see if the result: if result is True then its related with pandas usage, otherwise it is most probably the string unicode support. – furkanayd Dec 02 '19 at 10:22
1

thx. `df.duplicated().any()` returns `False`. So the question is, how to solve this highly likely unicode problem? e.g. one file with unicode_A, another unicode_B? – Sean.H Dec 02 '19 at 10:55

python: drop_duplicates(subset='col_name', inplace=True), why some of the rows can not be dropped?

1 Answers1