0

I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:

  1. Count of unique values in a feature / sample size < 10%
  2. Count of most common value / Count of second most common value > 20 times

The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.

My attempt:

n = df.shape[0]/10

variable = []
condition_1 = []
condition_2 = []

for i in df:
    variable.append(i)
    condition_1.append(df[i].unique().shape[0] < n)
    condition_2.append(most_common_value_count/second_most_common_value_count > 20)
    
result = pd.DataFrame({"Variables": variable,
          "Condition_1": condition_1,
          "Condition_2": condition_2})

The dataset df contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.

Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.

Chris
  • 2,019
  • 5
  • 22
  • 67
  • 3
    Is there any reason why the [second answer](https://stackoverflow.com/a/25943480/10863327) to the linked question doesn't suit your needs? – Kyle Parsons Aug 03 '21 at 20:56
  • 2
    Also, I would challenge the usefulness of dropping columns with few distinct values as a form of dimensionality reduction. For example this would drop gender from any dataset with more than 20 rows and often gender is a meaningful feature. It also isn't invariant to dataset size. You may have a feature that truly has 100 levels. You won't drop it if you have fewer than 1000 observations, but you will if you have more observations. – Kyle Parsons Aug 03 '21 at 20:59
  • 2
    https://stackoverflow.com/a/13691215/1275942 also finds that pandas unique performs as well or better than numpy's at least in some cases--numpy sorts, pandas does not. – Kaia Aug 03 '21 at 21:06

1 Answers1

0

As noted in the comments, you may want to use np.unique (or pd.unique). You can set return_counts=True to get the value counts. These will be the second item in the tuple returned by np.unique, hence the [1] index below. After sorting them, the most common count will be the last value, and the second most common count will be the next to last value, so you can get them both by indexing with [-2:].

You could then construct a Boolean list indicating which columns meet your condition #2 (or rather the opposite). This list can then be used as a mask to reduce the dataframe:

def counts_ratio(s):
    """Take a pandas series s and return the
    count of its most common value / 
    count of its second most common value."""
    counts = np.sort(np.unique(s, return_counts=True)[1])[-2:]
    return counts[1] / counts[0]
    
condition2 = [counts_ratio(df[col]) <= 20
              for col in df.columns]  

df_reduced = df[df.columns[condition2]]
Arne
  • 9,990
  • 2
  • 18
  • 28