0

I have a problem filling in values in a column with pandas. I want to add strings which should describe the annual income class of a customer. I want 20% of the length of the data frame to get the value "Lowest", 9% of the data frame should get "Lower Middle" etc... I thought of creating a list and appending the values and then set it as the value for the column but then I get a ValueError Length of values (5) does not match length of index (500)

list_of_lists = []
list_of_lists.append(int(0.2*len(df))*"Lowest")
list_of_lists.append(int(0.09*len(df))*"Lower Middle")
list_of_lists.append(int(0.5*len(df))*"Middle")
list_of_lists.append(int(0.12*len(df))*"Upper Middle")
list_of_lists.append(int(0.12*len(df))*"Highest")
df["Annual Income"] = list_of_lists

Do you have an idea of what could be the best way to do this?

Thanks in advance Best regards Alina

Alina
  • 21
  • 5
  • 1. `list_of_lists` is a list of 5 strings, but the strings are repeats of the provided strings ('LowestLowestLowestLowest...'). Instead of multiplying the string, multiply the string inside a list: `list_of_lists.append(int(0.2*len(df))*["Lowest"])`. Use `list(flatten(list_of_lists))` to flatten the list (`from itertools import flatten`). 2. This is not a complete solution: it will fail because the size of the new list isn't as the size of the dataframe because summing all the `int(X*len(df))` is not as the same dataframe length. – itaishz Dec 07 '20 at 16:15
  • please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). At the least, we need sample data to copy/paste, and a sample of what you WANT the output to look like. – Ukrainian-serge Dec 07 '20 at 16:25

1 Answers1

1

You can use numpy to do a weighted choice. The method has a list of choices, the number of choices to make, and the probabilities. You could generate this and just do df['Annual Income'] = incomes

I've printed out the value counts so you can see what the totals were. It will be slightly different every time.

Also I had to tweak the probabilities so they add up to 100%

import pandas as pd
from numpy.random import choice
incomes = choice(['Lowest','Lower Middle','Middle','Upper Middle','Highest'], 500,
              p=[.2,.09,.49,.11,.11])

df= pd.DataFrame({'Annual Income':incomes})


df.value_counts()

Annual Income
Middle           245
Lowest            87
Upper Middle      66
Highest           57
Lower Middle      45
Chris
  • 15,819
  • 3
  • 24
  • 37