1

I have been trying to use python and bioinfokit to create a volcano plot of gene expression data that's in an excel file. I used pandas to create a data frame and then eliminated some negative values. I then tried to create the volcano plot in the last line of code.

import pandas as pd
import numpy as np
import bioinfokit
from bioinfokit import analys, visuz


panda_brie = pd.read_csv("C:\\Users\\amorgan\\Documents\\brie_gRNA_stats.csv", encoding='ISO-8859-1', low_memory=False)
shape = panda_brie.shape
print(shape)
panda_brie = panda_brie.loc[(panda_brie[("fold_change")] > 0)]
shape = panda_brie.shape
print(shape)


bioinfokit.visuz.gene_exp.volcano(df=panda_brie, lfc="log_fold_change", pv="log_p_value")

I received the following error and am not sure what to do.

Traceback (most recent call last):
  File "C:/Users/amorgan/AppData/Local/Programs/Python/Python39/graphing brie data.py", line 19, in <module>
    bioinfokit.visuz.gene_exp.volcano(df=panda_brie, lfc="log_fold_change", pv="log_p_value")
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\bioinfokit\visuz.py", line 397, in volcano
    df.loc[(df[lfc] >= lfc_thr[0]) & (df[pv] < pv_thr[0]), 'color_add_axy'] = color[0]  # upregulated
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
    return method(self, other)
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\arraylike.py", line 52, in __ge__
    return self._cmp_method(other, operator.ge)
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 5501, in _cmp_method
    res_values = ops.comparison_op(lvalues, rvalues, op)
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\ops\array_ops.py", line 284, in comparison_op
    res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
  File "C:\Users\amorgan\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\ops\array_ops.py", line 73, in comp_method_OBJECT_ARRAY
    result = libops.scalar_compare(x.ravel(), y, op)
  File "pandas\_libs\ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>=' not supported between instances of 'str' and 'int'

Header of my panda data frame in case this helps

                    Unnamed: 0  control.avg  ...  log_fold_change  log_p_value
0   Syt15_GGTACCACAAATGGTACACT         7.80  ...      0.421772618     9.665546
1  Fbxo21_CTTGTGTGCAAAACCCTCCG         3.67  ...      0.678371984     8.397940
2   Irgc1_GAGGCCCTCGGGTTTCAGCG         3.10  ...      0.736525011     8.151195
3  Ttll12_CCTGTGTCTAGGTCCCTTAG         3.98  ...      0.622833399     9.659556
4   Kdm4b_ATGTCATCATACGTCTGCCG         4.41  ...      0.545893109     9.899629

Output of panda_brie.info()

[5 rows x 24 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50629 entries, 0 to 53135
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   50629 non-null  object 
 1   control.avg                  50629 non-null  float64
 2   Tg50.avg                     50629 non-null  float64
 3   Tg100.avg                    50629 non-null  float64
 4   Tg150.avg                    50629 non-null  float64
 5   Tg250.avg                    50629 non-null  float64
 6   Treated.vs.Nontreated.p      50629 non-null  float64
 7   Treated.vs.Nontreated.FDR    50629 non-null  float64
 8   Treated.vs.Nontreated.logFC  50629 non-null  float64
 9   Treated.vs.Nontreated.FC     50629 non-null  float64
 10  Dose.Regression.p            50629 non-null  float64
 11  Dose.Regression.FDR          50629 non-null  float64
 12  Dose.Regression.Slope        50629 non-null  float64
 13  gene                         50629 non-null  object 
 14  gRNASeq                      50629 non-null  object 
 15  Unnamed: 15                  0 non-null      float64
 16  Unnamed: 16                  0 non-null      float64
 17  Unnamed: 17                  13 non-null     object 
 18  Unnamed: 18                  3 non-null      object 
 19  Unnamed: 19                  3 non-null      object 
 20  Unnamed: 20                  1 non-null      object 
 21  fold_change                  50629 non-null  float64
 22  log_fold_change              50629 non-null  object 
 23  log_p_value                  50629 non-null  float64
dtypes: float64(16), object(8)
memory usage: 9.7+ MB
  • Hi @AustinMorgan, can you show us (a few lines of) your data? See [how to make a good example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) of your problem so we can help you – Cimbali Jul 09 '21 at 18:56
  • @Cimbali I added in the top few rows of the panda data frame. Would the raw data from the excel spread sheet be more helpful? – Austin Morgan Jul 09 '21 at 19:10
  • Not really but a mix of significant and non-significant genes would − right now I can’t really reproduce your issue. – Cimbali Jul 09 '21 at 19:49
  • The output of `panda_brie.info()` would be useful too – Cimbali Jul 09 '21 at 21:04
  • error `'>=' not supported between instances of 'str' and 'int'` can suggest that one of column has values as strings and you should convert column to integers or floats. – furas Jul 09 '21 at 21:05
  • @Cimbali I'm adding the output of panda_brie.info() – Austin Morgan Jul 09 '21 at 21:32
  • 1
    @Cimbali I see what you were going for... I changed the data type of log_fold_change to a float and it worked! Thanks for your help! – Austin Morgan Jul 09 '21 at 21:46

0 Answers0