0

For starters, I think these 2 questions are on the correct path, but do not exactly hit what I want.

Pandas : remove SOME duplicate values based on conditions

How to conditionally remove duplicates from a pandas dataframe

I have a very large DataFrame made up of tickets. Each ticket has a few types of text fields. In some tickets, the 2 different types of text fields will have the same text. If this is the case, I use only the DESCRIPTION type. A sample DataFrame is as follows:

TICKETID    TYPE    TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

Basically, I want to examine each TICKETID as it's own entity. Compare the PROBLEMCODE and DESCRIPTION text; if equal, filter out the PROBLEMCODE row and keep description.

In my mind the pseudocode is:

For a given ticketID:
    if Type = PROBLEMCODE or DESCRIPTION
        if TEXT = TEXT
            DROP PROBLEMCODE

Obviously, looping through a dataframe this way is not efficient. Pandas has plenty to do this referenced in the questions posted earlier. I am just having trouble figuring out which combination of methods and assignments will accomplish this. I have tried:

# to create a dup row
data['Dup'] = data.duplicated(subset=['TEXT'])
# Then groupby ticket?
data.groupby(['TICKETID'])

# somehow compare true and false, but I can only do that in order of index (down the frame). 

# I am 99% percent sure looking at the other questions there should be a one or two liner 
# something like this that can accomplish:

dataTest = data.loc[data.groupby(['TICKETID']) & (data['TYPE'] =='PROBLEMCODE' | 'DESCRIPTION')].duplicated(subset=['TEXT'])

# Then filter based on true false

My expected output for the sample case would only delete the TICKET=123 PROBLEMCODE row as follows:

TICKETID    TYPE    TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

Let me know if you need more info

Sam Dean
  • 379
  • 9
  • 19

1 Answers1

1
    df = pd.DataFrame(
        {
            'ticket':[123,123,123,124,124,124],
            'type':['PROBLEMCODE','DESCRIPTION','code1','PROBLEMCODE','DESCRIPTION','code1'],
            'text':[' I want to use description fo',' I want to use description fo','other',
             'another str','second one','other'],
    
        }
    )
    print(df)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    2     123        code1                          other
    3     124  PROBLEMCODE                    another str
    4     124  DESCRIPTION                     second one
    5     124        code1                          other
    
    # you can see here in this df(duplicates), all duplicated rows for type == DESCRIPTION or PROBLEMCODE
    duplicates = df[
        (df.type.isin(['DESCRIPTION','PROBLEMCODE'])) &
        (df.duplicated(subset=['ticket','text'],keep=False))
    ]
    
    print(duplicates)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    
# remove duplicates from main df (using index to improve time)

df = df.drop(duplicates.index.tolist())
print(df)

# now concat duplicates with df (without description and problemcode

result = pd.concat([
    duplicates[duplicates.type=='DESCRIPTION'],df
]).sort_values(by='ticket').reset_index(drop=True)
print(result)
       ticket         type                           text
0     123  DESCRIPTION   I want to use description fo
1     123        code1                          other
2     124  PROBLEMCODE                    another str
3     124  DESCRIPTION                     second one
4     124        code1                          other

For solution above, you will received output without duplicates for DESCRIPTION and PROIBLEMCODE when ticket and text are same

sygneto
  • 1,761
  • 1
  • 13
  • 26