4

I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)

I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....

I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.

I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.

smci
  • 32,567
  • 20
  • 113
  • 146
Ryder Rude
  • 141
  • 5
  • So you want the ten most frequent integers in your `city` column? – Rm4n Jul 24 '20 at 06:02
  • @siamaksafari yeah, and 'other' for the rest – Ryder Rude Jul 24 '20 at 06:11
  • Does this answer your question? [Count frequency of values in pandas DataFrame column](https://stackoverflow.com/questions/36004976/count-frequency-of-values-in-pandas-dataframe-column) – Zaraki Kenpachi Jul 24 '20 at 06:17
  • 1
    @ZarakiKenpachi: no this is clearly not a duplicate of that. That question is merely about counting frequencies, nothing more. This question is about **assigning the 'long tail' of all values with low-frequencies to some Other/dont-care class**, for cardinality reduction. Totally different. – smci Jul 24 '20 at 07:24
  • 1
    *"The entries are integers for some reason and are currently assigned float type."* You control the dtype of how they got read in (with `pd.read_csv()`), in pandas 1.0.x you can now specify dtype='category' instead of int or float. Or convert them with `.astype()`. And if they were floats, that suggests the series has some NaNs, so figure out whether to do `fillna()`/`dropna()` first. – smci Jul 24 '20 at 07:29

1 Answers1

5

Let's go through the logic of expected actions:

  1. Count frequencies for every city
  2. Calculate the bottom 10% percentage
  3. Find the cities with frequencies less then 10%
  4. Change them to other

You started in the right direction. To get frequencies for every city:

city_freq = (df['city'].value_counts())/df.shape[0]

We want to find the bottom 10%. We use pandas' quantile to do it:

bottom_decile = city_freq.quantile(q=0.1)

Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:

less_freq_cities = city_freq[city_freq<=botton_decile]

less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":

df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

complete code:

city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.

EDIT: As suggested in comment, to get normalized frequency it's better use city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use: city_freq = df['city'].value_counts() and it will still work.

Roim
  • 2,986
  • 2
  • 10
  • 25
  • 1
    To get normalized frequencies, you can simplify `df[col].value_counts())/df.shape[0]` to `df[col].value_counts(normalize=True)` – smci Jul 24 '20 at 07:32
  • Thank you. When i'm thinking about it, no need to normalize at all. `quantile` will work without normalization – Roim Jul 24 '20 at 07:40
  • Normalizing frequencies can be useful, in an automated flow. Here the OP already knows they only need the top-10 most common levels. But in general we wouldn't know that. – smci Jul 24 '20 at 07:42
  • Second, it's more efficient, simpler, shorter to construct the list of `most_freq_cities`, which we know is only length 10, than the 990 `less_freq_cities` that we don't want. – smci Jul 24 '20 at 07:44
  • @smci Not sure I understand you. `less_freq_cities` can't be 990 cities. If there are 1000 unique cities, it hold about ~100. Why do you suggest to hold the most frequent ones? (maybe it's best if we move to the chat) – Roim Jul 24 '20 at 07:49
  • Roim: the OP says they want to reduce the unique city values from 1000 to ~10. That means 990 cities for `less_freq_cities`, right? (Your statement *"If there are 1000 unique cities, there are ~100 `less_freq_cities`"* is wrong. That's not how long-tailed distributions work. The 'long-tail' will have 990 of the 1000 values, but only <= 0.9 of relative frequency.) (It would help if we pick some standard public dataset as MCVE to use on this question). – smci Jul 24 '20 at 07:53
  • this returned an error saying "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)." The second answer in this link worked for me. Though it's not based on quantiles: https://stackoverflow.com/questions/52663432/dealing-with-sparse-categories-in-pandas-replace-everything-not-in-top-categor – Ryder Rude Jul 24 '20 at 08:01
  • Then I misunderstood the intention of the question. There is a difference between "let's take the top 10 frequencies cities" and "let's cut down the bottom 10 percentages". – Roim Jul 24 '20 at 08:58
  • @Roim I was fine with either. Bu yeah, yours is the answer to the stated question – Ryder Rude Jul 24 '20 at 10:51
  • 3
    To not set ALL columns with the value "other" the last line should be `df.loc[df["city"].isin(less_freq_cities.index.tolist()), "city"] = "other"` – Rune Kaagaard Dec 09 '22 at 11:32