I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:
- Count of unique values in a feature / sample size < 10%
- Count of most common value / Count of second most common value > 20 times
The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.
My attempt:
n = df.shape[0]/10
variable = []
condition_1 = []
condition_2 = []
for i in df:
variable.append(i)
condition_1.append(df[i].unique().shape[0] < n)
condition_2.append(most_common_value_count/second_most_common_value_count > 20)
result = pd.DataFrame({"Variables": variable,
"Condition_1": condition_1,
"Condition_2": condition_2})
The dataset df
contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.
Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.