2

I have a wide data frame that I want to convert to a long dataframe.

This isn't the actual wide dataframe I'm using. There are many more courses and many more 'values' for each course, so the dataframe is much wider than this. Not every course has all value columns associated with it (hence why bio1Csem is not there in the below dataframe).

I decided to attempt the solution on a smaller dataframe since I was running into so many issues on the larger dataframe). Unfortunately, I'm still struggling.

The data frame I'm working with:

>X = rbind( c( "1", "2.5","3.7","","2006 Fall","2007 Fall","Smith","Hu",""), 
c( "2" ,"3.7", "3.7", "3.5", "2007 Spring", "2007 Fall", 
"Smith","Hu","Langdon"), c("3" ,"4", "3.2", "4", "2007 Spring", "2007 Fall", 
"Smith","Hu","Langdon"))
> colnames(X) = c('id','bio1Agrade','bio1Bgrade','bio1Cgrade','bio1Asem',
'bio1Bsem','bio1Aprof', 'bio1Bprof','bio1Cprof')

> X
     id  bio1Agrade bio1Bgrade bio1Cgrade bio1Asem      bio1Bsem     bio1Aprof bio1Bprof bio1Cprof     
[1,] "1" "2.5"      "3.7"      ""         "2006 Fall"   "2007 Fall"  "Smith"   "Hu"      ""
[2,] "2" "3.7"      "3.7"      "3.5"      "2007 Spring" "2007 Fall"  "Smith"   "Hu"      "Langdon" 
[3,] "3" "4"        "3.2"      "4"        "2007 Spring" "2007 Fall"  "Smith"   "Hu"      "Langdon" 

I want it to look like this instead:

id    course  grade  semester     prof
1     bio1A   2.5    2006 Fall    Smith
1     bio1B   3.7    2007 Fall    Hu 
1     bio1C               
2     bio1A   3.7    2007 Spring  Smith
2     bio1B   3.7    2007 Fall    Hu
2     bio1C   3.5                 Langdon
3     bio1A   4      2007 Spring  Smith
3     bio1B   3.2    2007 Fall    Hu
3     bio1C   4                   Langdon

I figured reshape won't work since all my columns names are only characters without any obvious separators, and not all courses have, in this case, 3 columns that correspond to it.

I also thought about attempting a solution using tidyr, I'm struggling how to use it for multiple values.

Do any of you have a suggestion on how to go about this problem? Would it be easier to rename the columns, and add empty columns for courses that are 'missing' a column and use reshape? Is there another, ideally easier, way?

2 Answers2

1
library(dplyr)
library(tidyr)

df %>%
  gather(temp_col, value, -id) %>%
  mutate(course = gsub("(.*)(grade|sem|prof)", "\\1", temp_col),
         column_name = gsub("(.*)(grade|sem|prof)","\\2", temp_col)) %>%
  select(-temp_col) %>%
  spread(column_name, value)

Output is:

  id course grade    prof         sem
1  1  bio1A   2.5   Smith   2006 Fall
2  1  bio1B   3.7      Hu   2007 Fall
3  1  bio1C  <NA>                <NA>
4  2  bio1A   3.7   Smith 2007 Spring
5  2  bio1B   3.7      Hu   2007 Fall
6  2  bio1C   3.5 Langdon        <NA>
7  3  bio1A     4   Smith 2007 Spring
8  3  bio1B   3.2      Hu   2007 Fall
9  3  bio1C     4 Langdon        <NA>

Sample data:

df <- structure(list(id = 1:3, bio1Agrade = c(2.5, 3.7, 4), bio1Bgrade = c(3.7, 
3.7, 3.2), bio1Cgrade = c(NA, 3.5, 4), bio1Asem = c("2006 Fall", 
"2007 Spring", "2007 Spring"), bio1Bsem = c("2007 Fall", "2007 Fall", 
"2007 Fall"), bio1Aprof = c("Smith", "Smith", "Smith"), bio1Bprof = c("Hu", 
"Hu", "Hu"), bio1Cprof = c("", "Langdon", "Langdon")), .Names = c("id", 
"bio1Agrade", "bio1Bgrade", "bio1Cgrade", "bio1Asem", "bio1Bsem", 
"bio1Aprof", "bio1Bprof", "bio1Cprof"), class = "data.frame", row.names = c(NA, 
-3L))
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Prem
  • 11,775
  • 1
  • 19
  • 33
1

We can do this with melt from data.table which can take multiple measure patterns

library(data.table)
nm1 <- substr(names(df)[-1], 1, 5)
melt(setDT(df), measure = patterns("grade$", "prof$", "sem$"), 
   value.name = c("grade", "prof", "sem"),
     variable.name = "course")[, course := nm1[course]][order(id)]
#   id course grade    prof         sem
#1:  1  bio1A   2.5   Smith   2006 Fall
#2:  1  bio1B   3.7      Hu   2007 Fall
#3:  1  bio1C    NA                  NA
#4:  2  bio1A   3.7   Smith 2007 Spring
#5:  2  bio1B   3.7      Hu   2007 Fall
#6:  2  bio1C   3.5 Langdon          NA
#7:  3  bio1A   4.0   Smith 2007 Spring
#8:  3  bio1B   3.2      Hu   2007 Fall
#9:  3  bio1C   4.0 Langdon          NA
akrun
  • 874,273
  • 37
  • 540
  • 662