0

I have following csv file:

SRA ID  ERR169499            ERR169498           ERR169497
Label   1                    0                   1
TaxID   PRJEB3251_ERR169499  PRJEB3251_ERR169499 PRJEB3251_ERR169499
333046  0.05                 0.99                99.61
1049    0.03                 2.34                34.33
337090  0.01                 9.78                23.22
99007   22.33                2.90                0.00

I have 92 columns for case for which label is 0 and 95 columns for control for which label is 1. I have to perform two sample independent T-Test and ranksum test So far I have:

df  = pd.read_csv('final_out_transposed.csv', header=[1,2], index_col=[0])
case = df.xs('0', axis=1, level=0).dropna()
ctrl = df.xs('1', axis=1, level=0).dropna()
(tt_val, p_ttest) = ttest_ind(case, ctrl, equal_var=False)

For which I am getting the error: ValueError: operands could not be broadcast together with shapes (92,) (95,).

The traceback is:

File "<ipython-input-152-d58634e75106>", line 1, in <module>
runfile('C:/IBD Bioproject/New folder/temp_3251.py', wdir='C:/IBD 
Bioproject/New folder')

File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)

File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/IBD Bioproject/New folder/temp_3251.py", line 106, in <module>
tt_val, p_ttest = ttest_ind(case, ctrl, equal_var=False)

File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 4068, in ttest_ind
df, denom = _unequal_var_ttest_denom(v1, n1, v2, n2)


File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 3872, in _unequal_var_ttest_denom
df = (vn1 + vn2)**2 / (vn1**2 / (n1 - 1) + vn2**2 / (n2 - 1))

ValueError: operands could not be broadcast together with shapes (92,) (95,)

I read few posts but its still unclear also I went through numpy broadcast.

Thanks in advance

K.S
  • 113
  • 13

1 Answers1

1

Apparently the objects created by the xs method of the Pandas DataFrame look like two-dimensional arrays. These must be flattened to look like one-dimensional arrays when passed to ttest_ind.

Try this:

ttest_ind(case.values.ravel(), ctrl.values.ravel(), equal_var=False)

The values attribute of the Pandas objects gives a numpy array, and the ravel() method flattens the array to one-dimension.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • It worked wonders. Thank you so much I was stuck at this since a long time. – K.S Feb 01 '18 at 11:13
  • Hi the code works fine, however I am not able to get the results for each row. I tried `df.iterrows` but it doesn't seen to work. How can I get the results of ttest for each row of case and control ? I have been trying but doesnt seem to get the right apparoch. Thanks – K.S Feb 05 '18 at 08:06
  • 1
    Can you convert your `case` and `ctrl` objects to plain two-dimensional numpy arrays (with shape `(n, 92)` and `(n, 95)`, resp.)? Then you can use `ttest_ind(case, ctrl, axis=1, equal_var=False)`. – Warren Weckesser Feb 05 '18 at 09:42
  • I dont know if that is possible for me because im having a csv file with multiple columns labelled as `0` and `1`. For which i did `case = df.xs('0', axis=1, level=0).dropna()` `ctrl = df.xs('1', axis=1, level=0).dropna()`. – K.S Feb 05 '18 at 09:54
  • Is there any other way i can do it ? – K.S Feb 05 '18 at 10:04
  • Can you create a [minimal, complete and verifiable example](https://stackoverflow.com/help/mcve) that we can run? I don't really understand what your data looks like. Somehow you have to restructured your data to either pass individuals rows to `ttest_ind` (and use a loop to do all the tests), or pass 2-d numpy arrays as in my previous comment. – Warren Weckesser Feb 05 '18 at 10:18
  • Sure. Will do that in the question. . – K.S Feb 05 '18 at 10:20
  • I have updated my question. I have total of 744 rows and 186 columns. Like I mentioned in my question above. – K.S Feb 05 '18 at 10:25
  • Hi, can I use `pandas.Dataframe.as_matrix` to make it 2-d numpy array ? Also i have found a post[https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array-preserving-index] suggesting `df.to_records` . will this be helpful in my case ? – K.S Feb 05 '18 at 15:16
  • @K.S. Sorry, I don't use Pandas enough to answer that. My suggestion hasn't completely solved your problem, so feel free to unaccept this answer, add the Pandas tag to the question, and add more information to the question about your DataFrame and what you want to do with it. – Warren Weckesser Feb 05 '18 at 18:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164594/discussion-between-k-s-and-warren-weckesser). – K.S Feb 06 '18 at 08:00