0

In my dataset, one of the columns is Education. There should be 5 categories. However, some of them are repeated. I would like to combine them. What code should I write?

table(df_all$Education)

Output:
Less than Primary Less than Primary           Primary         Secondary Tertiary or above Tertiary or above           Unknown 
              206                 3              1174              3494               455                 3               969 

I would like to merge both "less than primary" and "Tertiary or above" together.

Updates I just checked that "Less than Primary" and "Tertiary or above" do not have extra blankspace. I still don't know why they are different.

oohsehun
  • 31
  • 5
  • 1
    Hi! You should re-factorize your variable. Take a look to this link https://stackoverflow.com/questions/19410108/cleaning-up-factor-levels-collapsing-multiple-levels-labels – R18 Oct 24 '22 at 06:10
  • @R18 Seems like the variables are in different form. I have tried all the method and still cannot be modified. I change the original "Less than Primary" to "A" and then find the sum(df_all$Education) but the output is 0. – oohsehun Oct 24 '22 at 06:50
  • 1
    The program considers different "Less than Primary" and "Less than Primary " because the final space (may be it is not the case), but only for one character as an space, R considers two different answers, so you have to take that into account. – R18 Oct 24 '22 at 07:26
  • @R18 I checked that there is no blankspace. – oohsehun Oct 24 '22 at 07:48
  • 1
    What do you get when running `sum(df_all$Education=="Less than Primary")`? – R18 Oct 24 '22 at 08:06
  • 1
    They need to provide a `dput()` at this point, as we have no way of knowing how they checked and are providing conflicting information. Please see here for information on the how to use `dput()` and write a question so others can help you. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – socialscientist Oct 24 '22 at 08:23

1 Answers1

0

The issue is almost certainly that there is whitespace around some of the values in df_all$Education. You'll need to recode the variable, which appears to be a character vector rather than a factor as otherwise table() would produce a description of the levels of the factor.

Below shows some ways to do this.

# Example character vector: note the extra spaces in x[2]: " foo "
x <- c("foo", " foo ", "bar")

# Problem: looks like duplicated values but they're NOT duplicated -- one just
# has spaces around it and it's hard to see.
table(x)
#> x
#>  foo    bar   foo 
#>     1     1     1

# Solution: recode x by removing leading and trailing whitespace
x_trimmed <- trimws(x, which = "both")

# Fixed
table(x_trimmed)
#> x_trimmed
#> bar foo 
#>   1   2

# Many other ways to recode x to produce the same result

# Alternative 1
x1 <- x
x1[2] <- "foo"
table(x1)
#> x1
#> bar foo 
#>   1   2

# Alternative 2
x2 <- x
x2[x2 %in% c(" foo ")] <- "foo"
table(x2)
#> x2
#> bar foo 
#>   1   2
socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • I have tried your method. The table of the levels showed that there is no whitespace. – oohsehun Oct 24 '22 at 06:59
  • Those are not "levels." Please provide a reproducible example that shows exactly your output from your input. – socialscientist Oct 24 '22 at 07:02
  • after I input `x_trimmed <- trimws(df_all$Education, which = "both")` , when I check the table, the output is still `x_trimmed Less than Primary Less than Primary Primary Secondary Tertiary or above Tertiary or above Unknown ` – oohsehun Oct 24 '22 at 07:05
  • Need exact code and data to reproduce your result. You can use dput(). However, it's straightforward to see based on what you pasted that there are spaces: look at the space in the output between "Secondary" and "tertiary or Above" then look at the space between "Primary" and "Secondary". – socialscientist Oct 24 '22 at 07:07
  • the exact output of `levels(df_all$Education)` is `"Less than Primary" "Less than Primary" "Primary" "Secondary" "Tertiary or above" "Tertiary or above" "Unknown" ` . Honestly, I am new to R, I don't know what to do.. – oohsehun Oct 24 '22 at 07:14
  • I don't know why you are suddenly using `levels()` when you use `table()` in your post...I never mention using `levels()` at any point. For the question you posted, where using `table()` produces the exact output you provided in the question, this answer should work. If you did not provide the exact output and code, then it likely won't work. I recommend reading this for how to ask questions/provide information in a way that people can help you better. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – socialscientist Oct 24 '22 at 08:21