0

I have a dataset where few data are "?"(see below image for reference)

workclass has a single "?" in this sample data

age         workclass fnlwgt     education education_num         marital_status
39         State-gov  77516     Bachelors            13          Never-married
31           Private  45781       Masters            14          Never-married
42           Private 159449     Bachelors            13     Married-civ-spouse
30           Private 188146       HS-grad             9     Married-civ-spouse
30           Private  59496     Bachelors            13     Married-civ-spouse
44           Private 343591       HS-grad             9               Divorced
44           Private 198282     Bachelors            13     Married-civ-spouse
32      Self-emp-inc 317660       HS-grad             9     Married-civ-spouse
17                 ? 304873          10th             6          Never-married
28           Private 377869  Some-college            10     Married-civ-spouse
38  Self-emp-not-inc 120985       HS-grad             9     Married-civ-spouse
40       Federal-gov  56795       Masters            14          Never-married

sample of my dataset I have tried filter, where and few other matching functions but it doesnt capture the ? in a string or int as well.

I am new to R language and not able to get a solution for this.

I want to get a count of data which has "?" in it and then based on the count decide to remove the rows or fill it with some meaningful data.

UPDATE :::

I data was " ?" rather than "?". Couldnt make out by looking at it Once i got that info was able to handle it. It was a human error rather than the data/code i was trying :D

  • Does this answer your question? [How do I deal with special characters like \^$.?\*|+()\[{ in my regex?](https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex) – NelsonGon Apr 15 '20 at 12:03
  • 1
    `i <- which(df$workclass == "?")` Then use the index `i` to your desired effect. `length(i)` gives the count of `"?"`, for instance. – Rui Barradas Apr 15 '20 at 12:03
  • `dplyr`: `filter(df, workclass=="?")` or use `grepl` inside `filter` – NelsonGon Apr 15 '20 at 12:05

2 Answers2

1

Assuming your dataframe is called "df", you can get the count by running: sum(df$workclass == "?")

I would consider converting these values to proper NA-values, for example by running: df$workclass <- ifelse(df$workclass == "?", NA, df$workclass).

Once you have them converted to NAs, you may for instance remove them by na.omit(df) or you could use imputation techniques like mode imputation or KNN-imputation, to name a few. You may read more about imputation techniques and handling of missing values here: https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184

veghokstvd
  • 183
  • 1
  • 8
  • Just tried this method to convert the ? to NA but it changed the string values from--------- State-gov Self-emp-not-inc Private Federal-gov [5] Local-gov ? Self-emp-inc Without-pay [9] Never-worked ---------------- to ----------------- [1] 8 7 5 2 3 NA 6 9 4 ---------- i just ran unique before and after converting ? to NA – Nikhil Balbadri Apr 15 '20 at 13:21
  • I am not sure why that wouldn't work, but you could also try: `df$workclass[df$workclass == "?"] <- NA` – veghokstvd Apr 15 '20 at 13:53
0

To remove:

revised_df = df [ -which ( df$workclass == "?" ) , ]

To replace: Let's say with "a"

df [ which ( df$workclass == "?" ) , "workclass"] = "a"