Why are empty levels in my factor tabulated after I assign NAs to missing values?

Question

I have a dataframe df with a column foo containing data of type factor:

df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))

When I inspect the structure with str(df$foo), I get this:

Factor w/ 3 levels "","F",..: 2 2 2 2 2 2 2 2 2 2 ..

Why does it report 3 levels when there are only 2 in my data?

Edit:

There seems to be a missing value "" that I clean up by assigning it NA. When I call table(df$foo), it seems to still count the "missing value" level, but finds no occurences:

  F M
0 2 2

However, when I call df$foo I find it reports only two levels:

Levels:  F M

How is it possible that table still counts the empty level, and how can I fix that behaviour?

Seems that you have empty values in some of your cells for `MF`. Try `table(df$MF)` to get the counts. — AntoniosK, Oct 25 '18 at 12:26
I suspect a missing value in the MF column. Are those 4 rows you posted the whole frame, or is there more? — Oliver Baumann, Oct 25 '18 at 12:26
Thanks! I checked if there were emptied cells but there are none.. If I do table(df$MF) I find this: 0, F 220, M 21. Where is that 0 coming from? — Afke, Oct 25 '18 at 12:30
Please don't post images that are badly confidentialised. I suggest you have a peek at [this guide](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) regarding how to post great reproducible R [MCVE](https://stackoverflow.com/help/mcve)s — Oliver Baumann, Oct 25 '18 at 12:47
@Afke, I edited my answer. Please never remove the original question content, as all answers that were made prior to the edit then become incoherent. Instead, please edit the question and **append** any new observations. That way, your question will provide a more complete picture, and the answers will still be coherent. — Oliver Baumann, Oct 25 '18 at 13:06

Oliver Baumann · Accepted Answer · 2018-10-25T13:03:49.673

Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:

# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)

# works if your missing value is just ""
which(df$MF == "")

You should then clean up your dataframe to properly refeclet missing values. A factor will handle NA:

df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA

Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table counting occurences of the empty level.

Observe this sequence of steps and its outputs:

# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] ""  "F" "M"

# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)

  F M 
1 2 2

# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)

# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] ""  "F" "M"

# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)

  F M 
0 2 2 

# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)

# factors fixed
> levels(df$MF)
[1] "F" "M"

# tabulation fixed
> table(df$MF)

F M 
2 2

Thanks Oliver! I checked if I had missing values.. but both two top code lines return : integer(0). How is that possible? Thanks for you comment, I will edit the question to remove the bad images. — Afke, Oct 25 '18 at 12:49
@Afke, I proposed quite a large edit to your question and title because I think it might be interesting to others. See if you like it, or feel free to edit it yourself! :) — Oliver Baumann, Oct 25 '18 at 13:27

Why are empty levels in my factor tabulated after I assign NAs to missing values?

1 Answers1