Casting with data.table in r based on a unique ID number that can appear in various rows

Question

3rd EDIT

I have tried the tidyverse solution, and it works for my example data, but it does not work in my real data.

For the example:

Example2 <- Example %>% # tidyverse option
  gather(key, value, -(2:6), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)
dput(Example2)

Gives me this result:

attributes are not identical across measure variables;
they will be droppedstructure(list(Student_ID = c(9010307, 200810309, 200920773, 
201020497, 201030353, 201040559), Doc_Type = c("SSN", "SSN", 
"SSN", "SSN", "SSN", "DL"), Doc_Num = c(506786590, 546764202, 
546849791, 548017430, 547490424, 301147353), Last_Name = c("Sanchez", 
"Rivera", "Anderson", "Yang", "del Torre", "Smith"), First_Names = c("Jose", 
"Ana Maria", "Rachel Anne", "Amanda", "Amanda", "Daniel Erick"
), Campus_A = c(NA, NA, NA, "C", NA, "A"), Campus_B = c("A", 
"A", "B", "C", "A", "A"), Degree_Field_A = c(NA, NA, NA, "Civil Engineering", 
NA, "Education"), Degree_Field_B = c("Education", "Nursing", 
"Psychology", "Civil Engineering", "Psychology", "Education"), 
    Degree_Name_A = c(NA, NA, NA, "BS in Civil Engineering", 
    NA, "BA in Education"), Degree_Name_B = c("MA in Education", 
    "MS in Nursing", "MS in Psychology", "MS in Civil Engineering", 
    "MS in Psychology", "MA in Education"), Department_A = c(NA, 
    NA, NA, "Engineering", NA, "Education"), Department_B = c("Education", 
    "Health Sciences", "Health Sciences", "Engineering", "Health Sciences", 
    "Education"), Diploma_Number_A = c(NA, NA, NA, "7959", NA, 
    "7870"), Diploma_Number_B = c("7876", "7872", "7873", "12689", 
    "7875", "8155"), Exp_A = c(NA, NA, NA, "72", NA, "4"), Exp_B = c("3", 
    "2", "1", "5598", "7", "275"), Gender_A = c(NA, NA, NA, "F", 
    NA, "M"), Gender_B = c("M", "F", "F", "F", "F", "M"), Graduation_Date_A = c(NA, 
    NA, NA, "1440979200", NA, "1438560000"), Graduation_Date_B = c("1438560000", 
    "1438560000", "1438646400", "1512086400", "1438646400", "1445472000"
    ), Project_Type_A = c(NA, NA, NA, "Project", NA, "Project"
    ), Project_Type_B = c("Internship", "Thesis", "Internship", 
    "Thesis", "Thesis", "Internship")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

or if I shift the gather to gather(key, value, -(1:6), -Degree_Level) %>% I get this:

attributes are not identical across measure variables;
they will be droppedstructure(list(Exp = c(1, 2, 3, 4, 7, 72, 275, 5598), Student_ID = c(200920773, 
200810309, 9010307, 201040559, 201030353, 201020497, 201040559, 
201020497), Doc_Type = c("SSN", "SSN", "SSN", "DL", "SSN", "SSN", 
"DL", "SSN"), Doc_Num = c(546849791, 546764202, 506786590, 301147353, 
547490424, 548017430, 301147353, 548017430), Last_Name = c("Anderson", 
"Rivera", "Sanchez", "Smith", "del Torre", "Yang", "Smith", "Yang"
), First_Names = c("Rachel Anne", "Ana Maria", "Jose", "Daniel Erick", 
"Amanda", "Amanda", "Daniel Erick", "Amanda"), Campus_A = c(NA, 
NA, NA, "A", NA, "C", NA, NA), Campus_B = c("B", "A", "A", NA, 
"A", NA, "A", "C"), Degree_Field_A = c(NA, NA, NA, "Education", 
NA, "Civil Engineering", NA, NA), Degree_Field_B = c("Psychology", 
"Nursing", "Education", NA, "Psychology", NA, "Education", "Civil Engineering"
), Degree_Name_A = c(NA, NA, NA, "BA in Education", NA, "BS in Civil Engineering", 
NA, NA), Degree_Name_B = c("MS in Psychology", "MS in Nursing", 
"MA in Education", NA, "MS in Psychology", NA, "MA in Education", 
"MS in Civil Engineering"), Department_A = c(NA, NA, NA, "Education", 
NA, "Engineering", NA, NA), Department_B = c("Health Sciences", 
"Health Sciences", "Education", NA, "Health Sciences", NA, "Education", 
"Engineering"), Diploma_Number_A = c(NA, NA, NA, "7870", NA, 
"7959", NA, NA), Diploma_Number_B = c("7873", "7872", "7876", 
NA, "7875", NA, "8155", "12689"), Gender_A = c(NA, NA, NA, "M", 
NA, "F", NA, NA), Gender_B = c("F", "F", "M", NA, "F", NA, "M", 
"F"), Graduation_Date_A = c(NA, NA, NA, "1438560000", NA, "1440979200", 
NA, NA), Graduation_Date_B = c("1438646400", "1438560000", "1438560000", 
NA, "1438646400", NA, "1445472000", "1512086400"), Project_Type_A = c(NA, 
NA, NA, "Project", NA, "Project", NA, NA), Project_Type_B = c("Internship", 
"Thesis", "Internship", NA, "Thesis", NA, "Internship", "Thesis"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

The problem is, with my real data, I can do the (1:6) version without any problems, but it doesn't give me the output I want since it does not combine the rows based on Student_ID. But if I try it with (2:6) I get this error:

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 612 rows: * 113609, 113610 * 109095, 115383 * 110472, 110895 * 114397, 115479 * 113072, 114744 * 114414, 115480 * 108967, 111112 * 110532, 112950 * 110537, 112969 * 110492, 110493 * 110781, 110782 * 114412, 114413 * 115456, 115457 * 116933, 116934 * 117238, 117239 * 117050, 117134 * 115959, 115960 * 114521, 114522 * 13061, 13062 * 8547, 14835 * 9924, 10347 * 13849, 14931 * 12524, 14196 * 13866, 14932 * 8419, 10564 * 9984, 12402 * 9989, 12421 * 9944, 9945 * 10233, 10234 * 13864, 13865 * 14908, 14909 * 16385, 16386 * 16690, 16691 * 16502, 16586 * 15411, 15412 * 13973, 13974 * 38198, 38199 * 33684, 39972 * 35061, 35484 * 38986, 40068 * 37661, 39333 * 39003, 40069 * 33556, 35701 * 35121, 37539 * 35126, 37558 * 35081, 35082 * 35370, 35371 * 39001, 39002 * 40045, 40046 * 41522, 41523 * 41827, 41828 * 41639, 41723 * 40548, 40549 * 39110, 39111 * 138746, 138747 * 134232, 140520 * 135609, 136032 *

2nd EDIT

Thanks for the help so far, I wanted to update with a more useable data example.

> dput(Example)
structure(list(Exp = c(4, 3, 2, 7, 1, 72, 275, 5598), Student_ID = c(201040559, 
9010307, 200810309, 201030353, 200920773, 201020497, 201040559, 
201020497), Doc_Type = c("DL", "SSN", "SSN", "SSN", "SSN", "SSN", 
"DL", "SSN"), Doc_Num = c(301147353, 506786590, 546764202, 547490424, 
546849791, 548017430, 301147353, 548017430), Last_Name = c("Smith", 
"Sanchez", "Rivera", "del Torre", "Anderson", "Yang", "Smith", 
"Yang"), First_Names = c("Daniel Erick", "Jose", "Ana Maria", 
"Amanda", "Rachel Anne", "Amanda", "Daniel Erick", "Amanda"), 
    Gender = c("M", "M", "F", "F", "F", "F", "M", "F"), Degree_Field = c("Education", 
    "Education", "Nursing", "Psychology", "Psychology", "Civil Engineering", 
    "Education", "Civil Engineering"), Department = c("Education", 
    "Education", "Health Sciences", "Health Sciences", "Health Sciences", 
    "Engineering", "Education", "Engineering"), Campus = c("A", 
    "A", "A", "A", "B", "C", "A", "C"), Degree_Name = c("BA in Education", 
    "MA in Education", "MS in Nursing", "MS in Psychology", "MS in Psychology", 
    "BS in Civil Engineering", "MA in Education", "MS in Civil Engineering"
    ), Degree_Level = c("A", "B", "B", "B", "B", "A", "B", "B"
    ), Graduation_Date = structure(c(1438560000, 1438560000, 
    1438560000, 1438646400, 1438646400, 1440979200, 1445472000, 
    1512086400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Project_Type = c("Project", "Internship", "Thesis", "Thesis", 
    "Internship", "Project", "Internship", "Thesis"), Diploma_Number = c("7870", 
    "7876", "7872", "7875", "7873", "7959", "8155", "12689")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

In RStudio it looks like this: When I try the first solution offered, it looks like this:

Example
Example2 <- Example %>%
  gather(key, value, -(2:7), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)
dput(Example2)

This gives me in the console:

attributes are not identical across measure variables;
they will be droppedstructure(list(Student_ID = c(9010307, 200810309, 200920773, 
201020497, 201030353, 201040559), Doc_Type = c("SSN", "SSN", 
"SSN", "SSN", "SSN", "DL"), Doc_Num = c(506786590, 546764202, 
546849791, 548017430, 547490424, 301147353), Last_Name = c("Sanchez", 
"Rivera", "Anderson", "Yang", "del Torre", "Smith"), First_Names = c("Jose", 
"Ana Maria", "Rachel Anne", "Amanda", "Amanda", "Daniel Erick"
), Gender = c("M", "F", "F", "F", "F", "M"), Campus_A = c(NA, 
NA, NA, "C", NA, "A"), Campus_B = c("A", "A", "B", "C", "A", 
"A"), Degree_Field_A = c(NA, NA, NA, "Civil Engineering", NA, 
"Education"), Degree_Field_B = c("Education", "Nursing", "Psychology", 
"Civil Engineering", "Psychology", "Education"), Degree_Name_A = c(NA, 
NA, NA, "BS in Civil Engineering", NA, "BA in Education"), Degree_Name_B = c("MA in Education", 
"MS in Nursing", "MS in Psychology", "MS in Civil Engineering", 
"MS in Psychology", "MA in Education"), Department_A = c(NA, 
NA, NA, "Engineering", NA, "Education"), Department_B = c("Education", 
"Health Sciences", "Health Sciences", "Engineering", "Health Sciences", 
"Education"), Diploma_Number_A = c(NA, NA, NA, "7959", NA, "7870"
), Diploma_Number_B = c("7876", "7872", "7873", "12689", "7875", 
"8155"), Exp_A = c(NA, NA, NA, "72", NA, "4"), Exp_B = c("3", 
"2", "1", "5598", "7", "275"), Graduation_Date_A = c(NA, NA, 
NA, "1440979200", NA, "1438560000"), Graduation_Date_B = c("1438560000", 
"1438560000", "1438646400", "1512086400", "1438646400", "1445472000"
), Project_Type_A = c(NA, NA, NA, "Project", NA, "Project"), 
    Project_Type_B = c("Internship", "Thesis", "Internship", 
    "Thesis", "Thesis", "Internship")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

The problem is that with my actual data sample, I get this error in the console (and I hit Show Traceback)

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 324 rows: * 54956, 54957 * 50442, 56730 * 51819, 52242 * 55744, 56826 * 54419, 56091 * 55761, 56827 * 50314, 52459 * 51879, 54297 * 51884, 54316 * 51839, 51840 * 52128, 52129 * 55759, 55760 * 56803, 56804 * 58280, 58281 * 58585, 58586 * 58397, 58481 * 57306, 57307 * 55868, 55869 * 71714, 71715 * 67200, 73488 * 68577, 69000 * 72502, 73584 * 71177, 72849 * 72519, 73585 * 67072, 69217 * 68637, 71055 * 68642, 71074 * 68597, 68598 * 68886, 68887 * 72517, 72518 * 73561, 73562 * 75038, 75039 * 75343, 75344 * 75155, 75239 * 74064, 74065 * 72626, 72627 * 4682, 4683 * 168, 6456 * 1545, 1968 * 5470, 6552 * 4145, 5817 * 5487, 6553 * 40, 2185 * 1605, 4023 * 1610, 4042 * 1565, 1566 * 1854, 1855 * 5485, 5486 * 6529, 6530 * 8006, 8007 * 8311, 8312 * 8123, 8207 * 7032, 7033 * 5594, 5595 * 21440, 21441 * 16926, 23214 * 18303, 18726 * 22228, 23310 * 20903, 22575 * 22245, 23311 * 16798, 18943 * 18363, 20781
12.
stop(cnd)
11.
abort(glue("Each row of output must be identified by a unique combination of keys.", "\nKeys are shared for {shared} rows:", "\n{rows}", "Do you need to create unique ID with tibble::rowid_to_column()?"))
10.
spread.data.frame(., key, value)
9.
spread(., key, value)
8.
function_list[[k]](value)
7.
withVisible(function_list[[k]](value))
6.
freduce(value, `_function_list`)
5.
`_fseq`(`_lhs`)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
eval(quote(`_fseq`(`_lhs`)), env, env)
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1.
Example %>% gather(key, value, -(2:8), -Degree_Type) %>% unite(key, key, Degree_Type) %>% spread(key, value)

I am working with an Excel file with the graduation information of students at a certain university for the past 5 years. I want to manipulate this data in order to get an output with the student ID numbers of all those who have finished a bachelor's degree but not a master's degree.

The Excel file is more or less as follows:

Student_ID | Last_Name | First_Names | Gender | Degree_Field | Degree_Level | Project_Type | Graduation_Date | Degree_Name
20120001   | Smith     | Jane Ellen  | F      | Education    | A            | Exam         | 30/06/2016      | B.A. in Secondary Education
20130002   | Yang      | Henry       | M      | Nursing      | A            | Internship   | 29/06/2018      | B.S. in Nursing
20120001   | Smith     | Jane Ellen  | F      | Education    | B            | Thesis       | 20/11/2018      | M.A. in Secondary Education

Degree levels are A for Bachelors, B for Masters, and C for Doctorate. I want to manipulate this data in two different ways. First, I want a consolidated table with only one row per Student_ID, but I want to maintain the Degree_Field, Project_Type, Graduation_Date, and Degree_Name for each Degree_Level, as follows:

Student_ID | Last_Name | First_Names | Gender | Degree_Field_A | Project_Type_A | Graduation_Date_A | Degree_Name_A               | Degree_Field_B | Project_Type_B | Graduation_Date_B | Degree_Name_B
20120001   | Smith     | Jane Ellen  | F      | Education      | Exam           | 30/06/2016        | B.A. in Secondary Education | Educacation    | Thesis         | 20/11/2018        | M.A. in Secondary Education
20130002   | Yang      | Henry       | M      | Nursing        | Internship     | 29/06/2018        | B.S. in Nursing             | NA             | NA             | NA                | NA

Note how Jame Ellen Smith has a complete record because she finished first her Bachelors and then later her Masters, but Henry Yang has NA in all of the fields related to B because he has not finished a Masters yet. Once I have the data in this format, it should be easy to get two data displays, one that counts by Degree_Field_A to give a total count of how many students have both a Bachelors and a Masters in that field, and another for how many students have a Bachelors but do not have a Masters (in other words, the B fields are NA).

EDIT

I found an answer to a similar problem, but it is does not give me the results I need, although it is close. https://stackoverflow.com/a/44958373/1709198 For a student like Jane Ellen Smith, it gives Degree_Field_1, Project_Type_1, etc. and then Degree_Field_2, Project_Field_2, etc. as expected. The problem I have is that if a student got their Bachelors from a ti

Good catch, that was my mistake, I just fixed it in the question. — Qanthelas, May 01 '19 at 00:11

score 1 · Accepted Answer · answered May 01 '19 at 04:35

1

A tidyverse option would be to first gather data in long format, combine columns using unite with Degree_Level and then spread it back to wide format.

library(tidyverse)

df %>%
  gather(key, value, -(1:4), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)  

#  Student_ID Last_Name First_Names Gender Degree_Field_A Degree_Field_B
#1   20120001     Smith  Jane Ellen      F      Education      Education
#2   20130002      Yang       Henry      M        Nursing           <NA>

#               Degree_Name_A                Degree_Name_B  Graduation_Date_A
# B.A. in Secondary Education  M.A. in Secondary Education         30/06/2016
#             B.S. in Nursing                         <NA>         29/06/2018

# Graduation_Date_B Project_Type_A Project_Type_B
#        20/11/2018           Exam         Thesis
#              <NA>     Internship           <NA>

data

df <- structure(list(Student_ID = c("20120001", "20130002", "20120001"
), Last_Name = c("Smith", "Yang", "Smith"), First_Names = c("Jane Ellen", 
"Henry", "Jane Ellen"), Gender = c("F", "M", "F"), Degree_Field = 
c("Education", "Nursing", "Education"), Degree_Level = c("A", "A", "B"), 
Project_Type = c("Exam", "Internship", "Thesis"), 
Graduation_Date = c("30/06/2016", "29/06/2018","20/11/2018"), 
Degree_Name = c("B.A. in Secondary Education", "B.S. in Nursing", 
"M.A. in Secondary Education")), row.names = c(NA, -3L), class = "data.frame")

answered May 01 '19 at 04:35

Ronak Shah

377,200
20
156
213

Thanks! I posted an update to the question with a better data example in case other people find this post later on. – Qanthelas May 01 '19 at 19:15
...but now there's a problem, in the sample it works but in my real data it gives me an error. I have updated the original question with the error. – Qanthelas May 01 '19 at 19:32
@Qanthelas I think it's about selecting variables in `gather`, for example this would work for your updated `Example`. `Example %>% gather(key, value, -(2:6), -Degree_Level) %>% unite(key, key, Degree_Level) %>% spread(key, value) ` you get a warning message but I think it can be ignored. – Ronak Shah May 02 '19 at 02:09
thanks for the reply, and I do indeed get the results I would expect with my sample data, but with my real data I get an error. I just updated my original question to show the error I get. – Qanthelas May 02 '19 at 03:44
I don't really understand what exactly the `key` is in the `gather` and in the `unite`. I want it to use Student_ID, which is in the 2nd column, and if there are multiple rows with the same Student_ID then it should consolidate them, so I don't see what the problem is if there are duplicates, but it won't get past this error. – Qanthelas May 02 '19 at 03:52
The answer worked for 2 of your example and now there is another edit. With so many edits, I am not able to find where exactly the issue is. Also it is advised to give a complete verifiable example which represents your actual data and if you keep changing your question then it invalidates other answers which is not keeping track of your edits. – Ronak Shah May 02 '19 at 05:18

Jason Johnson · Answer 2 · 2019-05-01T20:15:30.583

I think you should be able to get your desired output simply by chaining melt and dcast the data.table way.

IDvars<-c("Student_ID","Last_Name","First_Names","Gender")
MeasureVars<-c("Degree_Field","Project_Type","Graduation_Date","Degree_Name")

DT[,melt(.SD, measure.vars = MeasureVars )][,dcast(.SD,paste(paste0(IDvars,collapse = "+"),"~","Degree_Level","+","variable"))]

A couple of notes about the code above:

I assumed your data.table was called DT but if not change it accordingly and I specified just the four measuring variables for the melt.
Executing the melt code will give you a data.table with all the IDvars, Degree_Level, a column named 'variable' by default that includes the names of the measure variables, and a column named 'value' by default that contains the values of the measure variables.
Note for the dcast formula I just used paste to avoid typing in all the IDvars separated by a +. paste0 with the collapse argument is useful here. Basically you need the IDvars added together on the LHS and Degree_Level + 'variable' on the RHS.
.SD is a special symbol in data.table that allows you to chain the temporary results you get from the melt without saving it.

Hope that helps and let me know if my explanation is clear. Good luck!

Edit: Just saw you updated with a more realistic data set. I reran my code with it and it works but you will get a warning because the classes of the measure variables are inconsistent. They will get automatically coerced to character so it should not impact things too much. Likely that is the reason you are having issues with the dplyr solution.

Thanks for the reply, but unfortunately this answer gives me the same types of problems I was having before with data.table based solutions. I get the columns that I want, and the IDvars columns have the data I would expect in them (Student_ID, Last_Name, etc.) but once I get to the new columns like A_Degree_Field I only have either a 1 or a 0. From what I have read in posts like https://stackoverflow.com/questions/12831524/can-dcast-be-used-without-an-aggregate-function/12831856 I think the problem is about the times where multiple rows have the same Student_ID. — Qanthelas, May 02 '19 at 00:34

Casting with data.table in r based on a unique ID number that can appear in various rows

2 Answers2