For starters, I think these 2 questions are on the correct path, but do not exactly hit what I want.
Pandas : remove SOME duplicate values based on conditions
How to conditionally remove duplicates from a pandas dataframe
I have a very large DataFrame made up of tickets. Each ticket has a few types of text fields. In some tickets, the 2 different types of text fields will have the same text. If this is the case, I use only the DESCRIPTION
type. A sample DataFrame is as follows:
TICKETID TYPE TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
Basically, I want to examine each TICKETID
as it's own entity. Compare the PROBLEMCODE
and DESCRIPTION
text; if equal, filter out the PROBLEMCODE
row and keep description.
In my mind the pseudocode is:
For a given ticketID:
if Type = PROBLEMCODE or DESCRIPTION
if TEXT = TEXT
DROP PROBLEMCODE
Obviously, looping through a dataframe this way is not efficient. Pandas has plenty to do this referenced in the questions posted earlier. I am just having trouble figuring out which combination of methods and assignments will accomplish this. I have tried:
# to create a dup row
data['Dup'] = data.duplicated(subset=['TEXT'])
# Then groupby ticket?
data.groupby(['TICKETID'])
# somehow compare true and false, but I can only do that in order of index (down the frame).
# I am 99% percent sure looking at the other questions there should be a one or two liner
# something like this that can accomplish:
dataTest = data.loc[data.groupby(['TICKETID']) & (data['TYPE'] =='PROBLEMCODE' | 'DESCRIPTION')].duplicated(subset=['TEXT'])
# Then filter based on true false
My expected output for the sample case would only delete the TICKET=123 PROBLEMCODE row as follows:
TICKETID TYPE TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
Let me know if you need more info