pandas multiple condition and add multiple column

Question

I have a df:

import pandas as pd
df.head(20)
                          id  ch      start        end strand
0   10:100026072-100029645(+)  10  100026072  100029645      +
1   10:110931880-110932381(+)  10  110931880  110932381      +
2   10:110932431-110933096(+)  10  110932431  110933096      +
3   10:111435307-111439556(-)  10  111435307  111439556      -
4   10:115954439-115964883(-)  10  115954439  115964883      -
5   10:115986231-116018509(-)  10  115986231  116018509      -
6   10:116500106-116500762(-)  10  116500106  116500762      -
7   10:116654355-116657389(-)  10  116654355  116657389      -
8   10:117146840-117147002(-)  10  117146840  117147002      -
9   10:126533798-126533971(-)  10  126533798  126533971      -
10  10:127687390-127688824(+)  10  127687390  127688824      +
11    10:19614164-19624369(-)  10   19614164   19624369      -
12    10:42537888-42543687(+)  10   42537888   42543687      +
13    10:61927486-61931038(-)  10   61927486   61931038      -
14    10:70699779-70700206(-)  10   70699779   70700206      -
15    10:76532243-76532565(-)  10   76532243   76532565      -
16    10:79336852-79337034(-)  10   79336852   79337034      -
17    10:79342487-79343173(+)  10   79342487   79343173      +
18    10:79373277-79373447(-)  10   79373277   79373447      -
19    10:82322045-82337358(+)  10   82322045   82337358      +

df.shape
(501, 5)

>>>df.dtypes
 id        object
 ch        object
 start     object
 end       object
 strand    object
 dtype: object

Question:

I would like to perform multiple operations based on 'start' and 'end' column

first create two additional columns called

newstart newend  

desiredoperation: if strand == '+':
                     df['newstart'] = end - int(27)
                     df['newend'] = end + 2
                  elif: 
                      strand == '-'
                      df['newstart'] = start - int(3)
                      df['newend'] = start + 26

how can i do this using pandas, I found the link below but not sure how to execute it. If any one can provide a pseudo code will build up on it. adding multiple columns to pandas simultaneously

score 1 · Accepted Answer · answered May 23 '17 at 00:18

1

You can do it using np.where, 2 lines but readable

df['newstart'] = np.where(df.strand == '+', df.end-int(27), df.start-int(3))
df['newend'] = np.where(df.strand == '+', df.end+int(2), df.start+int(26))

    id                         ch   start       end      strand newstart    newend
0   10:100026072-100029645(+)   10  100026072   100029645   +   100029618   100029647
1   10:110931880-110932381(+)   10  110931880   110932381   +   110932354   110932383
2   10:110932431-110933096(+)   10  110932431   110933096   +   110933069   110933098
3   10:111435307-111439556(-)   10  111435307   111439556   -   111435304   111435333
4   10:115954439-115964883(-)   10  115954439   115964883   -   115954436   115954465
5   10:115986231-116018509(-)   10  115986231   116018509   -   115986228   115986257
6   10:116500106-116500762(-)   10  116500106   116500762   -   116500103   116500132
7   10:116654355-116657389(-)   10  116654355   116657389   -   116654352   116654381
8   10:117146840-117147002(-)   10  117146840   117147002   -   117146837   117146866
9   10:126533798-126533971(-)   10  126533798   126533971   -   126533795   126533824

answered May 23 '17 at 00:18

Vaishali

37,545
5
58
86

Oh I see I need to call numpy as np. sorry I have edited my code but i guess your answer still would be the same right? – novicebioinforesearcher May 23 '17 at 00:22
are you working on jupyter i get this `TypeError: unsupported operand type(s) for -: 'str' and 'int'` – novicebioinforesearcher May 23 '17 at 02:00
So if i execute it on jupyter notebook It works fine, but on terminal i get the above error. – novicebioinforesearcher May 23 '17 at 02:56
1

Is the python version and all its libraries are the same between the environment in the jupyter notebook and the terminal? – arnold May 23 '17 at 03:00
@arnold terminal has . : u'0.20.1' and jupyter has u'0.19.2' – novicebioinforesearcher May 23 '17 at 03:27
1

@novicebioinforesearcher, sorry I missed all the messages. I am on jupyter notebook. One way I can think of is, try df.end.astype(int)-int(27). Since I am not able to reproduce the error, I can only make a guess – Vaishali May 23 '17 at 03:46
@A-Za-z yes it works, thank you i was trying to convert each column in to int – novicebioinforesearcher May 23 '17 at 03:54
1

Thank you for helping out two times in a row! – novicebioinforesearcher May 23 '17 at 04:09

Tom R · Answer 2 · 2017-05-28T18:01:30.133

1

If you want to do it in pandas, df.loc is a good candidate:

df['newstart'] = df['start'] - 3 
df['newend'] = df['start'] + 26
subset = df['strand'] == '+'
df.loc[subset,'newstart']=df.loc[subset,'end']-27
df.loc[subset,'newend']=df.loc[subset,'end']+2

I think it is a good idea to keep using pandas to process your data: it will keep your code consistent, and there is probably a better, shorter way to write the code above.

df.loc is a very useful function to perform data lookup and processing, try to fiddle with it since it is a great tool.

Enjoy

edited May 28 '17 at 18:01

answered May 23 '17 at 04:11

Tom R

66
6

Mmm, I must have missed something when pasting the code. I've edited my answer with the working code (tesed on python 3.5). – Tom R May 28 '17 at 15:54

pandas multiple condition and add multiple column

2 Answers2