0

I have a long dataset with student grades and courses going across many semesters. It has many NAs and many rows for each student. I want it to have one long row per student to fill in those NAs but keep the same column names.

Here's a sample:

library(tidyverse)
sample <- tibble(student = c("Corey", "Corey", "Sibley", "Sibley"),
                 fall_course_1 = c("Math", NA, "Science", NA),
                 fall_course_2 = c(NA, "English", NA, NA),
                 fall_grade_1 = c(90, NA, 98, NA),
                 fall_grade_2 = c(NA, 60, NA, NA))

And here's what I'd like it to look like:

library(tidyverse)
answer <- tibble(student = c("Corey", "Sibley"),
                 fall_course_1 = c("Math", "Science"),
                 fall_course_2 = c("English", NA),
                 fall_grade_1 = c(90, 98),
                 fall_grade_2 = c(60, NA))

Some semesters, some students take many classes and some just one. I've tried using coalesce(), but I can't figure it out. Any help would be appreciated!

J.Sabree
  • 2,280
  • 19
  • 48
  • What did you try that didn't work? – camille Mar 12 '20 at 00:03
  • @camille, I tried grouping by the student column and using coalesce(). I then tried the suggestion on this link, but it goes across columns, not down them: https://community.rstudio.com/t/coalesce-rows-and-remove-nas-but-keep-all-non-na-rows-for-each-unique-group/15894. – J.Sabree Mar 12 '20 at 00:06
  • 1
    Instead of continuing the struggle, you should probably reshape your data set into long format, where each grade is a row. This will make these sort of contortions unnecessary. See https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format – Axeman Mar 12 '20 at 00:07
  • I should have been more specific. Add your code attempts to the question, so folks can see the logic you're approaching it with and aren't suggesting things you've already tried – camille Mar 12 '20 at 00:12

4 Answers4

4

This should do it, pivot the data long, remove the na's and then pivot it back to wide.

You need to convert the numeric values to character temporarily so they can go in the same column as the course labels, then type_convert() is a lazy way to put them back again.

library(dplyr)
library(tidyr)
library(readr)

reshaped <- sample %>%
  mutate_if(is.numeric, as.character) %>%
  pivot_longer(-student) %>% 
  drop_na() %>% 
  pivot_wider(student, names_from = name, values_from = value) %>% 
  type_convert()
Axeman
  • 32,068
  • 8
  • 81
  • 94
NeilC
  • 188
  • 9
2

Using a custom colaesce function and dplyr:

coalesce_all_columns <- function(df) {
  return(coalesce(!!! as.list(df)))
}

library(dplyr)

sample %>%
  group_by(student) %>%
  summarise_all(coalesce_all_columns)

# A tibble: 2 x 5
  student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
  <chr>   <chr>         <chr>                <dbl>        <dbl>
1 Corey   Math          English                 90           60
2 Sibley  Science       NA                      98           NA
Edward
  • 10,360
  • 2
  • 11
  • 26
2

You could get the first non-NA value in each column for each student.

library(dplyr)
sample %> group_by(student) %>% summarise_all(~na.omit(.)[1])

# A tibble: 2 x 5
#  student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
#  <chr>   <chr>         <chr>                <dbl>        <dbl>
#1 Corey   Math          English                 90           60
#2 Sibley  Science       NA                      98           NA

This approach returns NA if there are all NA values in a group.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • this is great, thank you! Quick follow up, can you explain the summarise_all(~na.omit(.)[1]) aspect of your code? I can confirm that it works, but I'd love to know how to read that. Thank you! – J.Sabree Mar 12 '20 at 00:13
  • @J.Sabree As the name suggests, `na.omit` removes all the `NA` values and we select 1st value after removing `NA`'s using indexing `[1]`. When you have all `NA` values in the data, it returns `NA` by default. `na.omit(c(NA, NA, NA))[1]` – Ronak Shah Mar 12 '20 at 00:17
2

You could also use data.table package as follows:

library(data.table)
setDT(sample)[, lapply(.SD, na.omit), student]

sample
# 1:   Corey          Math       English           90           60
# 2:  Sibley       Science          <NA>           98           NA