0

I want to select to new dataframe, columns that have 'C' in value

protein 1   2   3   4   5
prot1   C   M   D   F   A
prot2   C   D   A   M   A 
prot3   C   C   D   F   A
prot4   S   D   F   C   L
prot5   S   D   A   I   L

So i want to have this:

protein 1   2   4   
prot1   C   M   F   
prot2   C   D   M    
prot3   C   C   F   
prot4   S   D   C   
prot5   S   D   I   

Number of colums can be n, i found examples only which i must specify column name... i cant do this here. The script should check column by colummn.

MTG
  • 191
  • 1
  • 22

2 Answers2

2
In [22]: df[['protein']].join(df[df.columns[df.eq('C').any()]])
Out[22]:
  protein  1  2  4
0   prot1  C  M  F
1   prot2  C  D  M
2   prot3  C  C  F
3   prot4  S  D  C
4   prot5  S  D  I
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Use:

np.random.seed(123)
n = np.random.choice(['C','M','D', '-'], size=(3,10))
n[:,0] = ['a','b','w']
foo = pd.DataFrame(n) 
print (foo)
   0  1  2  3  4  5  6  7  8  9
0  a  M  D  D  C  D  D  M  -  D
1  b  M  D  M  C  M  D  -  M  C
2  w  C  -  M  -  D  M  C  C  C

mask = foo.eq('C').any()
#set columns which need in output
mask.loc[0] = True

#filter
print (foo.loc[:,mask])
   0  1  4  7  8  9
0  a  M  C  M  -  D
1  b  M  C  -  M  C
2  w  C  -  C  C  C
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • It's cool but theres no protein column and somethimes columns not load all values :( – MTG Apr 22 '17 at 10:46
  • So need compare only columns which numbers? Al is possible set columns which are always in output? – jezrael Apr 22 '17 at 10:49
  • https://scontent-frt3-1.xx.fbcdn.net/v/t35.0-12/18110337_10207220766232360_1526389025_o.png?oh=7dfbf95e92fb78c6f64ace2728de378e&oe=58FDAAB0 dont know if links works – MTG Apr 22 '17 at 10:57
  • and yes :) proteins + 1..n – MTG Apr 22 '17 at 11:00
  • Thank you for link. I think main question is how add columns which are always in output, if no `C` in them. My edit solution works if all columns are numeric, then `cols2` return empty list. – jezrael Apr 22 '17 at 11:01
  • And also add all not numeric columns to output like `protein`, 'col'... – jezrael Apr 22 '17 at 11:03
  • If no C in them then this columns are omitted – MTG Apr 22 '17 at 11:04
  • wait, `protein` is index or column? what is `print (df.index)` ? – jezrael Apr 22 '17 at 11:04
  • First column are the names of proteins, and all other is aminoacids (str) or '-' – MTG Apr 22 '17 at 11:05
  • print (df.index) Index(['>Fungi|A0A0TG4.1/69-603 0061-domain-containing protein {ECO:00013|EMBL:E9955.1}', '>Fungi|G3ZV4.1/55-605 Uncharacterized protein {ECO:0013|EMBL:E24375.1}', '>Fungi|G7X4.1/49-584 U domain protein {ECO:0313|EMBL:GAA85617.1}'... – MTG Apr 22 '17 at 11:07
  • hmmm, so in output need `protein` column also if no `C` in them? And sometimes `protein` column is missing? If yes, second solution works perfectly. – jezrael Apr 22 '17 at 11:08
  • and no problem if some columns have no values, it works nice too. – jezrael Apr 22 '17 at 11:09
  • Need : protein 11 22 234 324 prot1 C A C D prot2 A C D C prot3 C A D F protein name + column where C occurred at least once sory dont know hp=ow to edit comment xD – MTG Apr 22 '17 at 11:11
  • Ok i will check your sollution :) – MTG Apr 22 '17 at 11:13
  • can you chceck what im doing wrong?/ foo = pd.DataFrame(n) foo = foo.rename(columns = {0: 'protein'}) df = foo.set_index('protein') new = df.loc[:,mask] canj you check whats wrong? cols2 = df.columns[~df.columns.str.isdigit()] mask = df.eq('C').any() mask.loc[cols2] = True new = df.loc[:,mask] – MTG Apr 22 '17 at 11:30
  • What is `n`? some data? – jezrael Apr 22 '17 at 11:31
  • yess. ok i got it :) just foo = pd.DataFrame(n) foo = foo.rename(columns = {0: 'protein'}) new = foo[['protein']].join(foo[foo.columns[foo.eq('C').any()]]) – MTG Apr 22 '17 at 11:37
  • Hmmm, but my solution works also. Or something problem? `foo = pd.DataFrame(n) foo = foo.rename(columns = {0: 'protein'})` – jezrael Apr 22 '17 at 11:40