Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I'm guessing this will require some kind of groupby
and hopefully some groupby
ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I'd like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example