0

In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:

Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.

I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).

  • Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.

Simplified example of a "semester" dataframe:

id ACT_math course_code section_code term_code grade term_GPA
0001 23 101 001 FA12 3.45 3.8
0002 28 201 003 FA12 3.2 3.4

Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).

  • I want to use the dplyr full_join() to match these individuals on an ID key.
  • Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
  • There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.

I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.

programmatic_database <- programmatic_database %>%
  full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")

However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.

  • I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
  • After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)

Concatenated example of what my code is producing after several iterations:

id ACT_math course_code section_code section_code.db section_code.db.db term_code grade.sp11 grade.fa12 grade.sp13 grade.sp15 term_GPA.sp11 term_GPA.fa12 term_GPA.sp15
0001 23 102 001 001 001 FA12 3.45 3.8 3.0 - 3.8 3.7 -
0002 28 201 003 003 003 FA12 3.2 3.4 3.0 - 3.8 3.7 -
1020 28 201 003 003 003 FA12 3.2 3.4 - - 3.8 3.7 -
6783 30 101 - - - SP15 - - - 3.8 - - 4.0

where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.

Questions:

  1. Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?

rbind number of columns do not match OR Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.

  1. Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
  2. Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.

Thank you for assistance!

Current code for simplicity:

#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003", 
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101", 
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L, 
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L, 
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0, 
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L, 
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15), 
    term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")

sp12 <- structure(list(id = c("1007", "1013", "1355", 
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L, 
24L), course_code = c(101L, 102L, 101L, 101L, 101L
    ), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2, 
    2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44, 
    43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L, 
    12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12, 
    43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785, 
    3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43), 
    cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L), 
    cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333, 
    3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA, 
5L), class = "data.frame")

# make object from fall 2011 semester dataframe
programmatic_database <- fa11

# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
  full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")

#view results of join, force integer type on certain variables as needed (see error above)

#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
  • Can you give some examples of what the source data looks like and what output you're seeking? Please check here for some suggestions on how to do that: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Jon Spring Mar 13 '21 at 04:02
  • Your error is cause by the column "term_code" being an integer in one data frame and a character string the other. Ensure that the matching columns are the data types between data frames and the problem will go away. – Dave2e Mar 13 '21 at 05:15
  • 1
    @JonSpring thank you for that example, it's my first time posting and I wasn't sure how to generate a reproducible example. I have updated the code at the end to include the sanitized output of dput() on two of the semester dataframes in question. – Elizabeth Day Mar 16 '21 at 19:38
  • @Dave2e this mismatch seems to occur at random through the 23 dataframes. Is there a global way to ensure that these are integer strings without checking the structure of each dataframe individually? – Elizabeth Day Mar 16 '21 at 19:39
  • I am assuming the columns are suppose to integers but due to missing values and others errors they are being converted to characters. I would define the column types in the `read.csv` function or immediately redefine the column with the `as.integer()` function. There is no harm in redefining the column to the same type. – Dave2e Mar 16 '21 at 21:52

1 Answers1

0

It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)

library(dplyr)  
set_column_types <- function(df) {
  df %>%
    mutate(term_code = as.character(term_code),
           course_code = as.character(course_code))
}
                    
bind_rows(
  fa11 %>% set_column_types(),
  sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
) 

This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.

    id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001                33         101      FA11            1   4.0               <NA>          1                      16                           0                       16        0.0000                       29                           13                        29   3.9615
2 1002                26         101      FA11            1   0.0                 PR          1                       0                           0                        0        0.0000                       15                            1                         1   0.2333
3 1003                27         101      FA11            1   0.0               <NA>          1                       0                           0                        0        0.0000                       18                           10                        14   2.3214
4 1013                25         101      FA11            1   2.5               <NA>          1                       0                           0                        0        0.0000                       15                           15                        15   2.9666
5 1007                24         101      SP12            1   2.0               <NA>          2                      44                          41                       41        3.3125                       56                           53                        53   2.8333
6 1013                26         102      SP12            1   2.5               <NA>          2                      43                          43                       43        3.1860                       59                           56                        59   3.4230
7 1355                25         101      SP12            1   2.0               <NA>          1                      12                          12                       12        3.5416                       25                           25                        25   3.1153
8 2779                24         101      SP12            1   1.5               <NA>          2                      43                          43                       43        3.1785                       64                           56                        64   2.1923
9 2302                24         101      SP12            1   3.5               <NA>          2                      30                          12                       30        3.8636                       43                           25                        43   3.6153
Jon Spring
  • 55,165
  • 4
  • 35
  • 53