0

enter image description here

For a dataframe named rawfile, we have a ? in the "workclass" column. We use the code rawfile.replace("?", "NaN") to replace all the ? with a NaN. But what if there are other unnormal values other than ?, like -. @. How can I detect them and replace them?

alex
  • 10,900
  • 15
  • 70
  • 100
foyeyefo
  • 33
  • 3
  • Welcome to SO, thanks for showing your efforts in form of code. Please do add your samples in form of text so that we could test our solutions it, thank you. – RavinderSingh13 May 17 '21 at 13:21

2 Answers2

0

Use regex in the replace method. Set the value to be what you want, but in the code below it is set to be np.NaN but you could use a string of NaN if desired.


df['workclass'] = df['workclass'].replace(to_replace = '\@|\?' , value=np.NaN, regex = True)


Joe Thor
  • 1,164
  • 1
  • 11
  • 19
  • what if there are many any unknown unnormal values except \@|\?. Are there a function to summary it all? – foyeyefo May 17 '21 at 14:18
  • You could use regex to exclude values that do not fit the pattern. Here is an answer that goes into the details: https://stackoverflow.com/a/2930209/4352317 – Joe Thor May 17 '21 at 14:21
0

This depends on the types of normal and not normal values you expect. If there is not some prepared set of answers you should use regex. On the other hand, if there are only answers specified in advance then you can use construction

df.loc[~df["workclass"].isin(allowed_answers), "workclass"] = np.NaN

I also recommend using np.NaN instead of "NaN" as np.NaN is not read as string.

VKRW
  • 1
  • 1