Why are there NA values produced when using pivot_wider?

Question

I'm trying to use pivot wider to create multiple columns/variables containing values, but I NAs in columns I shouldn't.

Here is a representative sample of the data:

df <- structure(list(Condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Retraction1", 
"Retraction2"), class = "factor"), First = structure(c(2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Second = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Third = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Fourth = structure(c(4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), ID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
"36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", 
"47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", 
"58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", 
"69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", 
"80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", 
"91", "92", "93", "94", "95", "96", "97", "98", "99", "100", 
"101"), class = "factor"), Scenario = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 1L, 2L, 3L, 4L), .Label = c("J", "P", "R", 
"S"), class = "factor"), Estimate = structure(c(4L, 8L, 7L, 11L, 
9L, 12L, 10L, 2L, 5L, 6L, 4L, 7L, 11L, 9L, 12L, 10L, 2L, 3L, 
5L, 6L, 4L, 8L, 7L, 11L, 9L, 12L, 10L, 2L, 5L, 6L, 4L, 8L, 7L, 
11L, 9L, 12L, 10L, 2L, 5L, 6L, 1L, 1L, 1L, 1L), .Label = c("CompMean", 
"P.H.Reps.", "P.H.Reps..1", "P.Rel.", "P.Rel1.Reps.", "P.Rel2.Reps.", 
"P.Rep1.nH.nRel.", "P.Rep1.nH.Rel.", "P.Rep2.nH.nRel.nRep1.", 
"P.Rep2.nH.nRel.Rep1.", "P.Rep2.nH.Rel.nRep1.", "P.Rep2.nH.Rel.Rep1."
), class = "factor"), value = c(90L, 8L, 82L, 11L, 82L, 11L, 
82L, 100L, 99L, NA, 62L, 11L, 91L, 12L, 91L, 5L, 82L, 91L, 80L, 
NA, 92L, 12L, 61L, 18L, 90L, 21L, 81L, 96L, 92L, NA, 91L, 10L, 
72L, 22L, 62L, 21L, 73L, 99L, 98L, NA, 7L, 7L, 7L, 7L)), row.names = c(NA, 
-44L), class = c("tbl_df", "tbl", "data.frame"))

head(df)

This is data from one subject. There should only be NAs in the P.Rel2.Reps. and no other.

However, there are NAs in some of the other columns when I use pivot wider like so:

pivot_wider(df, names_from = Estimate, values_from = value)

Here is an example of how the data look after pivoting wider.

df2 <- structure(list(Condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("Control", "Retraction1", "Retraction2"
), class = "factor"), First = structure(c(2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Journalist", "Police", "Reviewer", 
"Spokesperson"), class = "factor"), Second = structure(c(3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Third = structure(c(1L, 
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Fourth = structure(c(4L, 
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), ID = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", 
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", 
"38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", 
"49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", 
"60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", 
"71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", 
"82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", 
"93", "94", "95", "96", "97", "98", "99", "100", "101"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 
    2L), .Label = c("J", "P", "R", "S"), class = "factor"), P.Rel. = c(90L, 
    62L, 92L, 91L, 57L, 81L, 71L, 80L, 40L, 75L), P.Rep1.nH.Rel. = c(8L, 
    NA, 12L, 10L, 31L, NA, 19L, 17L, 25L, NA), P.Rep1.nH.nRel. = c(82L, 
    11L, 61L, 72L, 89L, 15L, 79L, 84L, 76L, 25L), P.Rep2.nH.Rel.nRep1. = c(11L, 
    91L, 18L, 22L, 35L, 64L, 30L, 22L, 25L, 50L), P.Rep2.nH.nRel.nRep1. = c(82L, 
    12L, 90L, 62L, 62L, 13L, 45L, 53L, 25L, 50L), P.Rep2.nH.Rel.Rep1. = c(11L, 
    91L, 21L, 21L, 15L, 52L, 9L, 10L, 100L, 50L), P.Rep2.nH.nRel.Rep1. = c(82L, 
    5L, 81L, 73L, 67L, 22L, 60L, 61L, 100L, 25L), P.H.Reps. = c(100L, 
    82L, 96L, 99L, 81L, 40L, 71L, 76L, 75L, 90L), P.Rel1.Reps. = c(99L, 
    80L, 92L, 98L, 81L, 80L, 89L, 79L, 75L, 76L), P.Rel2.Reps. = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), P.H.Reps..1 = c(NA, 
    91L, NA, NA, NA, 80L, NA, NA, NA, 100L), CompMean = c(7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 6L, 4L, 7L)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))

head(df2)

I have seen there is a similar post on this topic but it doesn't answer why NAs are being produced in my situation.

Do I need to add some other argument?

score 3 · Accepted Answer · answered Jan 07 '20 at 06:52

3

Looking at the data it looks like you have some corrupted data at one place. You can correct it by

df$Estimate <- replace(df$Estimate, df$Estimate == "P.H.Reps..1", "P.Rep1.nH.Rel.")

and then use pivot_wider which will give you NA only in column i.e P.Rel2.Reps.

tidyr::pivot_wider(df, names_from = Estimate, values_from = value)

answered Jan 07 '20 at 06:52

Ronak Shah

377,200
20
156
213

thanks so much! This worked, but I'm not sure why the data was corrupted in the first place :/ – Con Des Jan 07 '20 at 23:42

eipi10 · Answer 2 · 2020-01-07T07:24:53.983

NA values will result for any combination of categories for the new pivoted columns that aren't present in the original long data frame. For example, let's look at the rows of the long data frame with Estimate=="P.Rep1.nH.Rel.":

df %>% filter(Estimate=="P.Rep1.nH.Rel.")

  Condition First  Second   Third      Fourth       ID    Scenario Estimate       value
1 Control   Police Reviewer Journalist Spokesperson 1     J        P.Rep1.nH.Rel.     8
2 Control   Police Reviewer Journalist Spokesperson 1     R        P.Rep1.nH.Rel.    12
3 Control   Police Reviewer Journalist Spokesperson 1     S        P.Rep1.nH.Rel.    10

Now look at the results of pivot_wider (I've kept only the relevant columns for brevity). Note in the output below that there's a missing value in the P.Rep1.nH.Rel. column. The missing value occurs when Scenario=="P" because the long data frame doesn't have a row for P.Rep1.nH.Rel. with Scenario=="P" resulting in a missing value in the wide data frame. Missing values are occurring in the P.H.Reps..1 column for a similar reason, as there's only one row with Estimate=="P.H.Reps..1 in the long data frame and it has Scenario=="P". Thus, the values are missing for the other three scenarios.

pivot_wider(df, names_from = Estimate, values_from = value) %>% 
   select(Condition:Scenario, P.Rep1.nH.Rel., P.H.Reps..1)

  Condition First  Second   Third      Fourth       ID    Scenario P.Rep1.nH.Rel. P.H.Reps..1
1 Control   Police Reviewer Journalist Spokesperson 1     J                     8          NA
2 Control   Police Reviewer Journalist Spokesperson 1     P                    NA          91
3 Control   Police Reviewer Journalist Spokesperson 1     R                    12          NA
4 Control   Police Reviewer Journalist Spokesperson 1     S                    10          NA

This may be a data error, as suggested by @RonakShah, but if the data are correct then the NA values will naturally result when pivoting to wide format. You can fill the missing values with some other value by adding the argument values_fill=list(value=0) to pivot_wider (you can of course use any fill value you wish; I've just used 0 for illustration). Note that even if you use the values_fill argument, explicit missing values in the original long data will still be preserved in the wide data frame. Only missing values that result from the pivoting operation will be filled with a different value.

Thanks @eipi10. There were actually no missing values in those rows. Ronak's suggestion seemed to work anyway! — Con Des, Jan 07 '20 at 23:55

Why are there NA values produced when using pivot_wider?

2 Answers2

Linked