0

I have a file with ranked information:

df1

Type                 Rank
frameshift              1
stop_gained             2
stop_lost               3
splice_region_variant   4
splice_acceptor_variant 5
splice_donor_variant    6
missense_variant        7
coding_sequence_variant 8
intron_variant          9
NMD_transcript_variant  10
non_coding              11

and another file containing values separated with ,

df2['Variants']

A|intron_variant&NMD_transcript_variant|MODIFIER|23||,A|intron_variant&non_coding|MODIFIER|||,A|intron_variant&non_coding|MODIFIER|||
G|missense_variant&splice_region_variant|HIGH|85||,A|intron_variant&non_coding|MODIFIER|||,A|intron_variant&non_coding|MODIFIER|||
G|missense_variant|MODERATE|23||,G|frameshift&intron_variant|HIGH|||,G|intron_variant&non_coding|MODIFIER|||,G|frameshift&missense_variant|HIGH|42||
G|missense_variant|MODERATE|23||,G|intron_variant|MODIFIER|||,G|intron_variant&non_coding|MODIFIER|||,G|stop_gained&splice_region_variant|HIGH|||
G|missense_variant|MODERATE|23||
G|missense_variant&stop_lost|HIGH|12||

I want to extract the data from df2['Variants'] based on the rank order mentioned in the df1. some complication in the data are sometimes they are given combinedly with & as frameshift&intron_variant. In such cases, I want to split the data by & consider it by their rank. Likewise, I want to extract the values from the data as:

Extracted                                                   Ranked
A|intron_variant&NMD_transcript_variant|MODIFIER|23||       intron_variant
G|missense_variant&splice_region_variant|HIGH|85||          splice_region_variant
G|frameshift&intron_variant|HIGH|||                         frameshift
G|stop_gained&splice_region_variant|HIGH|||                 stop_gained
G|missense_variant|MODERATE|23||                            missense_variant
G|missense_variant&stop_lost|HIGH|12||                      stop_lost

I was able to split the files using & using the code given here. But unable to extract high ranked variants given in df1 from multiple values separated by , comma.

Thanks

2 Answers2

0

You can change list comprehension for first split by | (or ,) and then by &:

d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {z: d.get(z, max1) for y in x for y in x.split('|') for z in y.split('&')}
    #https://stackoverflow.com/a/280156/2901002
    return min(d1, key=d1.get)

df2['Anno_prio'] = df2['Variants'].apply(f)
print (df2)
                                            Variants              Anno_prio
0  A|intron_variant&NMD_transcript_variant|MODIFI...         intron_variant
1  G|missense_variant&splice_region_variant|HIGH|...  splice_region_variant
2  G|missense_variant|MODERATE|23||,G|frameshift&...             frameshift
3  G|missense_variant|MODERATE|23||,G|intron_vari...            stop_gained
4                   G|missense_variant|MODERATE|23||       missense_variant
5             G|missense_variant&stop_lost|HIGH|12||              stop_lost

EDIT:

For get also values splitted by , I prefer pandas solution with spliting first by , and then & or |:

d = df1.set_index('Type')['Rank'].to_dict()

df = (df2.assign(Extracted = df2['Variants'].str.split(','))
        .explode('Extracted')
        .assign(Ranked = lambda x: x['Extracted'].str.split('&|\|'))
        .explode('Ranked')
        .assign(Rank = lambda x: x['Ranked'].map(d))
        .sort_values('Rank')
        )
df = df[~df.index.duplicated()].sort_index()
print (df)
                         Variants     Extracted Ranked  Rank
0  K|a&b|MOD|,K|d|LOW|,K|a&e|MOD|    K|a&b|MOD|      a   1.0
1       J|c&d&a|MOD|,J|b&c&d|MOD|  J|c&d&a|MOD|      a   1.0
2                H|b&c|HIGH|,H|b|   H|b&c|HIGH|      b   2.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks for the answer, the problem here is, I want only one value in `Variants`, but here it extracts all values after comma. I need to keep only the value which contains high ranked variants.. thanks –  Mar 31 '21 at 10:53
  • @svp - Is possible change data sample for see problem? Thank you. – jezrael Mar 31 '21 at 10:54
0

@jezrael please see the data

#df1
Type    Rank
a       1
b       2
c       3
d       4
e       5

#df2['Variants']

K|a&b|MOD|,K|d|LOW|,K|a&e|MOD|
J|c&d&a|MOD|,J|b&c&d|MOD|
H|b&c|HIGH|,H|b|

#output
Extracted                   Ranked
K|a&b|MOD|                  a
J|c&d&a|MOD|                a
H|b&c|HIGH|                 b
  • 1
    No, it extracts all values. I need only one value which is considered for ranking variants. In first case there are three values `K|a&b|MOD|,K|d|LOW|,K|a&e|MOD|` from that I need only `K|a&b|MOD|` where it has `a` with high rank order.. thanks –  Mar 31 '21 at 11:16
  • will splitting by `,` help? –  Mar 31 '21 at 11:34
  • It seems more complicated like first seems, working for it. – jezrael Mar 31 '21 at 11:35
  • one thing - here is better only pandas solution, because need for each key like `K|a&b|MOD|` parse mutiple values `a,b` and multiple `Rank` - `1,2`. there is some reason why prefer this solution? – jezrael Mar 31 '21 at 11:52