I have a df as follows:
CHROM POS SRR4216489 SRR4216675 SRR4216480
0 1 127536 ./. ./. ./.
1 1 127573 ./. 0/1:0,5:5:0:112,1,10 ./.
2 1 135032 ./. 1/1:13,0:13:3240:0,30,361 0/0:13,0:13:3240:0,30,361
3 1 135208 ./. 0/0:5,0:5:3240:0,20,160 0/1:5,0:5:3240:0,20,160
4 1 138558 1/1:5,0:5:3240:0,29,177 0/0:0,5:5:0:112,1,10 ./.
I would like to replace the contents of the sample columns depending on certain conditions. The sample columns are SRR4216489, SRR4216675, SRR4216480. I am looking to replace './.' with 0.5, anything with 0/0 at the start with 0.0 and anything with 0/1 or 1/1 with 1.0. I appreciate that is involves several processes, most of which I can do independently but I don't know the syntax to tie them together. for example I could do this for sample SRR4216480:
df['SRR4216675'][df.SRR4216675 == './.'] = 0.5
This works well, courtesy of here, but I'm not sure how to apply it to all of the sample columns simultaneously. I thought to use a loop by:
sample_cols = df.columns[2:]
for s in sample_cols:
df[s][df.s =='./.'] = 0.5
but this firstly doesn't seem very pandonic and it also doesn't accept the string from the list at 'df.s' anyway.
The next challenge is how to parse the variable strings that populate the other parts of the sample columns. I have tried using the split function:
df=df['SRR4216675'][df.SRR4216675.split(':') == '0/0' ] = 0.0
but I get:
TypeError: 'float' object is not subscriptable
I am sure that a good way to solve this would be using a lambda such as this but being new to pandas and lambdas I'm finding it tricky, I got to here:
col=df['SRR4216675'][df.SRR4216675.apply(lambda x: x.split(':')[0])]
which looks like its almost there, but needs further processing to replace the value and also it looks like it has 2 columns and wont let me reintegrate it into the existing df:
SRR4216675
./. NaN
0/1 NaN
1/1 NaN
0/0 NaN
0/0 NaN
df['SRR4216675'] = col
ValueError: cannot reindex from a duplicate axis
I appreciate that this is several problems in 1 but I am new to pandas and would really like to get to grips with it. I could solve these problems using basic lists and loops with pythons standard list, iteration and string parsing functions but at scale this would be really slow as my full size df is millions of lines long and contains over 500 sample columns.