0

I'd like to know how to write an R command

which(apply(data, 2, var)==0)

... in Python.

Now I'm trying to run an R script to do a PCA.
But, pca() accepts only non-constant columns (=Variance is not 0).
For example, only Col2 can be accepted as non-constant column in the following:

Col1 Col2 Col3
0.0  1.2  4.0
0.0  1.5  4.0
0.0  1.3  4.0
0.0  1.1  4.0

I thought I removed all constant columns.
However, I got an error:

> Error in prcomp.default(data, center = TRUE, scale = TRUE) : 
  cannot rescale a constant/zero column to unit variance

I googled and found this question and the R command solution:

which(apply(oopsmat, 2, var)==0)

It worked for me. The command specified which columns were still constant.
So, I removed the columns manually, and the R script did a PCA.

Now I'd like to do the same thing in Python.
How would you write this R command in Python?

#####################################################

Please don't read below, or you'll waste your time.
I leave this as an evidence that I asked a silly question:

NOTE: This R command is strange.
As I said, I already removed all constant columns from my data.
This R command tells that the variance of the following column is 0
(Excerpted from roughly 50,000 data):

:  
0  
0  
4  
0  
19  
0  
32  
61  
878  
4  
1  
13  
16  
14  
2  
4  
13  
:  

The result of Excel's variance command VAR.P is 231.4.
This is not even close to 0!
I don't know what's going on and I can't find such a command in Python.
So, please explain this strange behavior, too.

*I overlooked the code that removed all outliers, that's why only 0s were left.

10 Rep
  • 2,217
  • 7
  • 19
  • 33
IanHacker
  • 541
  • 9
  • 27
  • 2
    In R, `apply` is run on arrays and (ill-advised but possible) on data frames (see #1 on [here](https://stackoverflow.com/users/3001626/david-arenburg)). These structures are built-in in R but not Python as its standard library does not maintain arrays or data frames. Please tag and describe the Python module being used. – Parfait Jul 31 '20 at 15:10
  • 1
    Incidentally the `which` function call is superfluous in the R code. – Konrad Rudolph Aug 04 '20 at 15:20
  • @Parfait OK, understood. I accept your answer as it is. Thank you. – IanHacker Aug 05 '20 at 06:54

1 Answers1

2

Essentially, the command apply(data, 2, var) in R runs on two-dimensional structures such as matrices or data frames (but not advised for latter) to compute a variance of all columns:

Data frame

set.seed(73120)

random_df <- data.frame(
  num1 = runif(500, 1, 100),
  num2 = runif(500, 1, 100),
  num3 = runif(500, 1, 100),
  num4 = runif(500, 1, 100),
  num5 = runif(500, 1, 100)
)

apply(random_df, 2, var)
#     num1     num2     num3     num4     num5 
# 822.9465 902.5558 782.4820 804.1448 830.1097 

And once which is applied, the index of named vector (i.e., 1-D array) is returned according to logic.

which(apply(random_df, 2, var) > 900)
# num2 
#    2 

Matrix

set.seed(73120)

random_mat <- replicate(5, runif(500, 1, 100))

apply(random_mat, 2, var)
# [1] 822.9465 902.5558 782.4820 804.1448 830.1097

which(apply(random_mat, 2, var) > 900)
# [1] 2

Pandas

In Python, using pandas (data analytics library), the equivalent is also apply: DataFrame.apply with axis set to index to run operations on all columns. Equivalently, you can run DataFrame.aggregate. The return is a Pandas Series, similar to R's named vector as a 1-D array.

import numpy as np
import pandas as pd

np.random.seed(7312020)

random_df = pd.DataFrame({'num1': np.random.uniform(1, 100, 500),
                          'num2': np.random.uniform(1, 100, 500),
                          'num3': np.random.uniform(1, 100, 500),
                          'num4': np.random.uniform(1, 100, 500),
                          'num5': np.random.uniform(1, 100, 500)
                         })

agg1 = random_df.apply('var', axis='index')
print(agg1)
# num1    828.538378
# num2    810.755215
# num3    820.480400
# num4    811.728108
# num5    885.514924
# dtype: float64

agg2 = random_df.aggregate('var')
print(agg2)
# num1    828.538378
# num2    810.755215
# num3    820.480400
# num4    811.728108
# num5    885.514924
# dtype: float64

R's which can be achieved with simple bracketed [...] (also doable in R), .loc, or where (keeping original dimensions):

agg[agg > 850]
# num5    885.514924
# dtype: float64

agg.loc[agg > 850]
# num5    885.514924
# dtype: float64

agg.where(agg > 850)
# num1           NaN
# num2           NaN
# num3           NaN
# num4           NaN
# num5    885.514924
# dtype: float64

Numpy

Additionally using Python's numpy (the numeric computing library that supports arrays), you can use numpy.apply_along_axis. And to equate to Pandas' var, adjust default ddof accordingly:

random_arry = random_df.to_numpy()

agg = np.apply_along_axis(lambda x: np.var(x, ddof=1), 0, random_arry)
print(agg)
# [828.53837793 810.75521479 820.48039962 811.72810753 885.51492378]

print(agg[agg > 850])
# [885.51492378]
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    This is good, except it forgets/violates the advice your own comment gave, because you are using `apply` on a data.frame. You could just use `sapply`/`vapply` here (or, since the point is to explain OP’s code, a matrix instead of a data frame). – Konrad Rudolph Aug 04 '20 at 15:20
  • Thanks @KonradRudolph,. Yes, I do run `apply` on R data frame really to show the translation to Pandas. I do not endorse the practice though and have edited the opening text. – Parfait Aug 04 '20 at 21:06