pandas replace contents of multiple columns at a time for multiple conditions

Question

I have a df as follows:

    CHROM     POS   SRR4216489              SRR4216675                  SRR4216480
0     1  127536     ./.                     ./.                         ./. 
1     1  127573     ./.                     0/1:0,5:5:0:112,1,10        ./.
2     1  135032     ./.                     1/1:13,0:13:3240:0,30,361   0/0:13,0:13:3240:0,30,361
3     1  135208     ./.                     0/0:5,0:5:3240:0,20,160     0/1:5,0:5:3240:0,20,160
4     1  138558     1/1:5,0:5:3240:0,29,177 0/0:0,5:5:0:112,1,10        ./.

I would like to replace the contents of the sample columns depending on certain conditions. The sample columns are SRR4216489, SRR4216675, SRR4216480. I am looking to replace './.' with 0.5, anything with 0/0 at the start with 0.0 and anything with 0/1 or 1/1 with 1.0. I appreciate that is involves several processes, most of which I can do independently but I don't know the syntax to tie them together. for example I could do this for sample SRR4216480:

df['SRR4216675'][df.SRR4216675 == './.'] = 0.5

This works well, courtesy of here, but I'm not sure how to apply it to all of the sample columns simultaneously. I thought to use a loop by:

sample_cols = df.columns[2:]
for s in sample_cols:
    df[s][df.s =='./.'] = 0.5

but this firstly doesn't seem very pandonic and it also doesn't accept the string from the list at 'df.s' anyway.

The next challenge is how to parse the variable strings that populate the other parts of the sample columns. I have tried using the split function:

df=df['SRR4216675'][df.SRR4216675.split(':') == '0/0' ] = 0.0

but I get:

TypeError: 'float' object is not subscriptable

I am sure that a good way to solve this would be using a lambda such as this but being new to pandas and lambdas I'm finding it tricky, I got to here:

col=df['SRR4216675'][df.SRR4216675.apply(lambda x: x.split(':')[0])]

which looks like its almost there, but needs further processing to replace the value and also it looks like it has 2 columns and wont let me reintegrate it into the existing df:

SRR4216675
./.    NaN
0/1    NaN
1/1    NaN
0/0    NaN
0/0    NaN

df['SRR4216675'] = col

ValueError: cannot reindex from a duplicate axis

I appreciate that this is several problems in 1 but I am new to pandas and would really like to get to grips with it. I could solve these problems using basic lists and loops with pythons standard list, iteration and string parsing functions but at scale this would be really slow as my full size df is millions of lines long and contains over 500 sample columns.

Look into the various `.str` methods in pandas series and into the `pd.Series.replace()` method. For example: `df.loc[:, ['SRR4216489', 'SRR4216675', 'SRR4216480']].replace("./.", 0.5, inplace=True)` — jkr, Jul 20 '17 at 14:07

LateCoder · Accepted Answer · 2017-07-20T15:35:23.333

You can do this by using df.apply and defining a function, like this:

In [10]: cols = ('SRR4216675', 'SRR4216480', 'SRR4216489')

In [11]: def replace_vals(row):
    ...:     for col in cols:
    ...:         if row[col] == './.':
    ...:             row[col] = 0.5
    ...:         elif row[col].startswith('0/0'):
    ...:             row[col] = 0
    ...:         elif row[col].startswith('0/1') or row[col].startswith('1/1'):
    ...:             row[col] = 1
    ...:     return row
    ...:
    ...:

In [12]: df.apply(replace_vals, axis=1)
Out[12]:
   CHROM     POS  SRR4216480  SRR4216489  SRR4216675
0      1  127536         0.5         0.5         0.5
1      1  127573         0.5         0.5         1.0
2      1  135032         0.0         0.5         1.0
3      1  135208         1.0         0.5         0.0
4      1  138558         0.5         1.0         0.0

And here's a faster way to do this:

First, let's create a larger data frame so that we can meaningfully measure differences in time, and let's import a timer so that we can measure.

In [70]: from timeit import default_timer as timer

In [71]: long_df = pd.DataFrame()

In [72]: for i in range(10000):
    ...:     long_df = pd.concat([long_df, df])

Using the function we defined above, we get:

In [76]: start = timer(); long_df.apply(replace_vals, axis=1); end = timer()

In [77]: end - start
Out[77]: 8.662535898998613

Now, we define a new function (for the purposes of timing easily) where we loop over the columns and apply the same replacement logic as above, except we do it by using the vectorized str.startswith method on each column:

In [78]: def modify_vectorized():
    ...:     start = timer()
    ...:     for col in cols:
    ...:         long_df.loc[long_df[col] == './.', col] = 0.5
    ...:         long_df.loc[long_df[col].str.startswith('0/0', na=False), col] = 0
    ...:         long_df.loc[long_df[col].str.startswith('0/1', na=False), col] = 1
    ...:         long_df.loc[long_df[col].str.startswith('1/1', na=False), col] = 1
    ...:     end = timer()
    ...:     return end - start

We recreate the large dataframe and we run the new function on it, getting a significant speedup:

In [79]: long_df = pd.DataFrame()

In [80]: for i in range(10000):
    ...:     long_df = pd.concat([long_df, df])
    ...:

In [81]: time_elapsed = modify_vectorized()

In [82]: time_elapsed
Out[82]: 0.44004046998452395

The resulting dataframe looks like this:

In [83]: long_df
Out[83]:
    CHROM     POS SRR4216480 SRR4216489 SRR4216675
0       1  127536        0.5        0.5        0.5
1       1  127573        0.5        0.5          1
2       1  135032          0        0.5          1
3       1  135208          1        0.5          0
4       1  138558        0.5          1          0
0       1  127536        0.5        0.5        0.5
1       1  127573        0.5        0.5          1
2       1  135032          0        0.5          1
3       1  135208          1        0.5          0
4       1  138558        0.5          1          0
0       1  127536        0.5        0.5        0.5
1       1  127573        0.5        0.5          1
2       1  135032          0        0.5          1
3       1  135208          1        0.5          0
4       1  138558        0.5          1          0
0       1  127536        0.5        0.5        0.5
...

Thanks for this, how could I automatically feed in the columns though? If I try to add an argument for columns to the function then supply the argument to the function when calling it doesn't work? — user3062260, Jul 20 '17 at 14:23
Updated my answer to make the columns that you want to modify more generic. — LateCoder, Jul 20 '17 at 14:27
Thank you very much! This solution works, though any further suggestions to speed it up would be great though not essential. It took about 5 minutes to run over 1 chromosome but I can live with that (there are 24 chromosomes to run it over). Thanks again! — user3062260, Jul 20 '17 at 14:41

pandas replace contents of multiple columns at a time for multiple conditions

1 Answers1