1

I have a df:

import pandas as pd
df.head(20)
                          id  ch      start        end strand
0   10:100026072-100029645(+)  10  100026072  100029645      +
1   10:110931880-110932381(+)  10  110931880  110932381      +
2   10:110932431-110933096(+)  10  110932431  110933096      +
3   10:111435307-111439556(-)  10  111435307  111439556      -
4   10:115954439-115964883(-)  10  115954439  115964883      -
5   10:115986231-116018509(-)  10  115986231  116018509      -
6   10:116500106-116500762(-)  10  116500106  116500762      -
7   10:116654355-116657389(-)  10  116654355  116657389      -
8   10:117146840-117147002(-)  10  117146840  117147002      -
9   10:126533798-126533971(-)  10  126533798  126533971      -
10  10:127687390-127688824(+)  10  127687390  127688824      +
11    10:19614164-19624369(-)  10   19614164   19624369      -
12    10:42537888-42543687(+)  10   42537888   42543687      +
13    10:61927486-61931038(-)  10   61927486   61931038      -
14    10:70699779-70700206(-)  10   70699779   70700206      -
15    10:76532243-76532565(-)  10   76532243   76532565      -
16    10:79336852-79337034(-)  10   79336852   79337034      -
17    10:79342487-79343173(+)  10   79342487   79343173      +
18    10:79373277-79373447(-)  10   79373277   79373447      -
19    10:82322045-82337358(+)  10   82322045   82337358      +

df.shape
(501, 5)

>>>df.dtypes
 id        object
 ch        object
 start     object
 end       object
 strand    object
 dtype: object

Question:

I would like to perform multiple operations based on 'start' and 'end' column

first create two additional columns called

newstart newend  

desiredoperation: if strand == '+':
                     df['newstart'] = end - int(27)
                     df['newend'] = end + 2
                  elif: 
                      strand == '-'
                      df['newstart'] = start - int(3)
                      df['newend'] = start + 26

how can i do this using pandas, I found the link below but not sure how to execute it. If any one can provide a pseudo code will build up on it. adding multiple columns to pandas simultaneously

2 Answers2

1

You can do it using np.where, 2 lines but readable

df['newstart'] = np.where(df.strand == '+', df.end-int(27), df.start-int(3))
df['newend'] = np.where(df.strand == '+', df.end+int(2), df.start+int(26))

    id                         ch   start       end      strand newstart    newend
0   10:100026072-100029645(+)   10  100026072   100029645   +   100029618   100029647
1   10:110931880-110932381(+)   10  110931880   110932381   +   110932354   110932383
2   10:110932431-110933096(+)   10  110932431   110933096   +   110933069   110933098
3   10:111435307-111439556(-)   10  111435307   111439556   -   111435304   111435333
4   10:115954439-115964883(-)   10  115954439   115964883   -   115954436   115954465
5   10:115986231-116018509(-)   10  115986231   116018509   -   115986228   115986257
6   10:116500106-116500762(-)   10  116500106   116500762   -   116500103   116500132
7   10:116654355-116657389(-)   10  116654355   116657389   -   116654352   116654381
8   10:117146840-117147002(-)   10  117146840   117147002   -   117146837   117146866
9   10:126533798-126533971(-)   10  126533798   126533971   -   126533795   126533824
Vaishali
  • 37,545
  • 5
  • 58
  • 86
1

If you want to do it in pandas, df.loc is a good candidate:

df['newstart'] = df['start'] - 3 
df['newend'] = df['start'] + 26
subset = df['strand'] == '+'
df.loc[subset,'newstart']=df.loc[subset,'end']-27
df.loc[subset,'newend']=df.loc[subset,'end']+2

I think it is a good idea to keep using pandas to process your data: it will keep your code consistent, and there is probably a better, shorter way to write the code above.

df.loc is a very useful function to perform data lookup and processing, try to fiddle with it since it is a great tool.

Enjoy

Tom R
  • 66
  • 6
  • Mmm, I must have missed something when pasting the code. I've edited my answer with the working code (tesed on python 3.5). – Tom R May 28 '17 at 15:54