-1

I need to set all the values that are greater than 30 in a dataset as missing values. My dataset is huge, but here is the format of my df:

date A B
01/01/2021 20 5
01/02/2021 15 20
01/01/2022 17 30
01/01/2023 30 40

expected result:

date A B
01/01/2021 20 5
01/02/2021 15 20
01/01/2022 17 NAN
01/01/2023 NAN NAN

Would be also nice to count how many of the values are bigger than 30, in this case, 3.

As asked below, I have tried the code:

df= df.apply(lambda x: [y if y <= 30 else NAN for y in x])

Output:

TypeError: '>=' not supported between instances of 'str' and 'int'

user17717499
  • 151
  • 7
  • he said "example of my df". Can assume that this is a pandas dataframe. Read before downvote. – DevOps Jan 19 '22 at 15:23
  • 1
    also, this question has already been answered : https://stackoverflow.com/questions/43757977/replacing-values-greater-than-a-number-in-pandas-dataframe. If you can't get to the solution yourself please post part of your code in your question :) – DevOps Jan 19 '22 at 15:24
  • Huh, `df[df >= 30] = np.nan`? `df[df >= 30].sum().sum()`? – timgeb Jan 19 '22 at 15:30
  • It is a pandas dataframe. – user17717499 Jan 19 '22 at 15:32
  • @timgeb I get the error "TypeError: '>=' not supported between instances of 'str' and 'int' " – user17717499 Jan 19 '22 at 15:34

2 Answers2

0

you should try:

df[df[["A", "B"]].astype(int)>=30] = pd.NA
  • In my original dataframe I have a very huge amount of columns, so I cannot write the name of each one in my command. I have tried: df3 [df3 [ [ ] ].astype (int)>=3] = pd.NA which returns " TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value" – user17717499 Jan 19 '22 at 15:43
  • 'columns = df.columns[df.columns!="date"]' and after 'df[df[columns].astype(int)>=30] = pd.NA', but it will work only if all other columns except "date" are numeric columns – Salvatore Daniele Bianco Jan 19 '22 at 16:51
  • The first line of code worked. Even though all my other columns are numeric, they contain some values non-numeric (NAN), because of that, the second line of code results in "IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer". Any idea on how to get around it? – user17717499 Jan 19 '22 at 20:25
  • Ok sorry for that it think that in this case you must cast to 'float' instead of 'int'. Try and let me know. Thanks – Salvatore Daniele Bianco Jan 20 '22 at 08:50
  • I have tried it, but it returns me "ValueError: cannot reindex from a duplicate axis!" . I do not understand what this error means. Do I have duplicated column names? I run " df3.columns.duplicated().any() ", and returned me "True" – user17717499 Jan 20 '22 at 10:05
0

You can replace the values in individual columns greater than or equal to 30 with 'NAN', and then count them.

enter image description here

df = pd.DataFrame(d)
df.A[df.A >= 30] = 'NAN'
df.B[df.B >= 30] = 'NAN'
print(df)
print('Values >= 30 in column A =', df.A.value_counts()['NAN'])
print('Values >= 30 in column B =', df.B.value_counts()['NAN'])