0

I'm going to drop duplicates by one of the columns, but some of the rows can be dropped.

the wired thing is: if i read the 2 files directly instead of by my func1, func2, then apply the drop function, every thing is fine!

update1: highly like is the unicode problem(thx to furkanayd), how to solve it?

Is there anyone can help? thx! Here is my code:

def func1(file):
    try:
        df1 = pd.read_csv('balba', encoding='utf8', low_memory=False)
    except UnicodeDecodeError:
        df1 = pd.read_csv('balba', encoding='gb18030', low_memory=False)
    """select the col_name, then replace ' ' with ''
    """
    return df1

def func2(file):
    df2 = pd.read_csv('balba')
    """select the col_name, then replace ' ' with '', then rename the column name
    """
    turn df2


df2 = func2(file_df2)

DF1 = []
for i in ['one_file_this_time']:
    d = func1(i)
    DF1.append(d)
df1 = pd.concat([DF1], sort=False)
df1.drop_duplicates(inplace=True)

df = pd.concat([df1, df2], sort=False)
print(df.shape)
# (7749, 2)

df.drop_duplicates(subset='col_name', inplace=True)
print(df.shape)
print(df.duplicated().any())
# (5082, 2)
# False
"""obviously the drop_duplicates() functions works, but not fullly"""

before drop function, the concatenated data is(i stored it to csv format): enter image description here

after drop function enter image description here

Sean.H
  • 640
  • 1
  • 6
  • 18

1 Answers1

0

You should consider to add keep argument in your drop_duplicates method as mentioned here

Right now your code works with below principle:

first : Drop duplicates except for the first occurrence.

Merge without dıplicates may help you. This question is very similar with Pandas merge creates unwanted duplicate entries

furkanayd
  • 811
  • 7
  • 19
  • @still get the duplicated rows after `keep='first` was added~~ – Sean.H Dec 02 '19 at 09:28
  • @Sean.H Are you sure that values considered duplicates are not different by checking logical operators. – furkanayd Dec 02 '19 at 10:07
  • Yes, i'm sure.I made 2 test data file. 1) If I got the df1 & df2 by my func(pls find them above), then `concat` , finally `drop_duplicates()`, what i got is the problem one. However, 2) if i read the 2 test files directly instead of through my func, what i got is no problem~~ alas, can't figure out why. – Sean.H Dec 02 '19 at 10:13
  • As far as I understand from code and your posts, string type may cause this unicode translation of read_csv may help this issue as it is explained here : https://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python – furkanayd Dec 02 '19 at 10:20
  • Also you may want to check the value of duplicates after drop_duplicates method called by using ```df.duplicated().any()``` to see if the result: if result is True then its related with pandas usage, otherwise it is most probably the string unicode support. – furkanayd Dec 02 '19 at 10:22
  • 1
    thx. `df.duplicated().any()` returns `False`. So the question is, how to solve this highly likely unicode problem? e.g. one file with unicode_A, another unicode_B? – Sean.H Dec 02 '19 at 10:55