How to extract the data from a column with priority order given in another file?

Question

I have two dataframes

df1

IMPACT     Rank
HIGH        1
MODERATE    2
LOW         3
MODIFIER    4

df2['Annotation']

Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||

A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||

A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||

There are multiple annotation in separated by , (comma), I want to consider only one annotation from the dataframe based on Rank in the df1. My expected outputwill be:

df['RANKED']

RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)

I tried following code to generate the output: but did not give me the expected result

d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
    return min(d1, key=d1.get)

df2['RANKED'] = df2['Annotation'].apply(f)

Any help appreciated..

I have a column containing values separated by `,`. I want to extract only the high ranked value from that column based on the rank details given in the another file — , Mar 31 '21 at 08:20
Problem is you assign ouput to second DataFrame, need assign to `df` like `df['RANKED'] = df['Annotation'].apply(f)` — jezrael, Mar 31 '21 at 08:24
it was typo error in my script. I used the same `df2['RANKED'] = df2['Annotation'].apply(f)`. But it gives the same data to `df2['Annotation']` in `df2['RANKED']` — , Mar 31 '21 at 09:35
@user288925 Just posted an answer with what I think is what you want. I hope it helps. But next time please follow this [guide on making good pandas questions](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). For instance, people will be much more likely to help you if you post the code that creates the dataframe, and not just the raw table data. — xicocaio, Mar 31 '21 at 22:23

xicocaio · Accepted Answer · 2021-03-31T22:25:33.633

TL;DR

df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])

Step-by-step

First you define your dataframes

df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})

df2 = pd.DataFrame({
    'Annotation':[
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
    })

Now here is the tricky part. You should create a column with the list of the split by comma string of the original Annotation column. Then you explode this column so you can have the objective values repeated for each original string.

df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')

Next, you extract the IMPACT word from each RANKED column.

df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)

Then, you merge df1 and df2 to get the rank of each RANKED.

df_merge = df2.merge(df1, how='left', on='IMPACT')

Finally, this is the easy part where you discard everything you do not want in the final dataframe. This can be done via groupby.

df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])

RANKED                                              Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187...   2
A|missense_variant|HIGH|PERM1|ENSG00000187642|...   1
A|missense_variant|LOW|PERM1|ENSG00000187642|T...   3

OR by dropping duplicates

df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])

How to extract the data from a column with priority order given in another file?

1 Answers1

TL;DR

Step-by-step