2

Sample of 2 (made-up) example rows in df:

userid   facultyid  courseid schoolid
167       265        NA       1678  
167       71111      301      NA

Suppose that I have a couple hundred duplicate userid like in the above example. However, the vast majority of userid have different values.

How can I combine rows with duplicate userid in such a way as to stick to the column values in the 1st (of the 2) row unless the first value is NA (in which case the NA will be repopulated with whatever value came from the second row)?

In essence, drawing from the above example, my ideal output would contain:

userid   facultyid  courseid schoolid
167       265        301       1678  
poeticpersimmon
  • 179
  • 1
  • 3
  • 8

4 Answers4

4
aggregate(x = df1, by = list(df1$userid), FUN = function(x) na.omit(x)[1])[,-1]

or use dplyr library:

library(dplyr)

df1 %>%
  group_by(userid) %>%
  summarise_each(funs(first(na.omit(.))))
bergant
  • 7,122
  • 1
  • 20
  • 24
  • Thank you, @bergant ! Could you briefly explain the different arguments in your function? – poeticpersimmon Mar 13 '15 at 23:19
  • Variable x in the function (FUN) is a column (vector) of values. It is called for every column in data.frame and only for values with the same userid. It is a grouped operation - the group is defined by `by` argument. – bergant Mar 13 '15 at 23:31
1
# initialize a vector that will contain row numbers which should be erased
rows.to.erase <- c()

# loop over the rows, starting from top
for(i in 1:(nrow(dat)-1)) {
  if(dat$userid[i] == dat$userid[i+1]) {
    # loop over columns to recuperate data when a NA is present
    for(j in 2:4) {
      if(is.na(dat[i,j]))
        dat[i,j] <- dat[i+1,j]
    }
    rows.to.erase <- append(rows.to.erase, i+1)
  }
}

dat.clean <- dat[-rows.to.erase,]
dat.clean
#   userid facultyid courseid schoolid
# 1    167       265      301     1678
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
1

Here's a different approach using ddply :

# requires the plyr package
library(plyr)

# Your example dataframe with added lines
schoolex <- data.frame(userid = c(167, 167, 200, 203, 203), facultyid = c(265, 71111, 200, 300, NA), 
                        courseid = c(NA, 301, 302, 303, 303), schoolid = c(1678, NA, 1678, NA, 1678))

schoolex_duprm <- ddply(schoolex, .(userid), summarize, facultyid2 = facultyid[!is.na(facultyid)][1], 
                               courseid2 = courseid[!is.na(courseid)][1], 
                               schoolid2 = schoolid[!is.na(schoolid)][1])
Harrison Jones
  • 2,256
  • 5
  • 27
  • 34
  • Hi Harrison, thanks so much for your answer! Is there a simpler way to replicate your function if I have nearly 1000 columns/variables, or would I have to write out a first() argument for each of them? – poeticpersimmon Mar 13 '15 at 20:28
  • @bergant has provided a solution for a large number of columns...see answer below – Harrison Jones Mar 13 '15 at 22:41
1

Here's a simple one-liner from plyr. I wrote it a bit more generally than you asked:

 a <- data.frame(x=c(1,2,3,1,2,3,1,2,3),y=c(2,3,1,1,2,3,2,3,1),
       z=c(NA,1,NA,2,NA,3,4,NA,5),zz=c(1,NA,2,NA,3,NA,4,NA,5))

 ddply(a,~x+y,summarize,z=first(z[!is.na(z)]),zz=first(zz[!is.na(zz)]))

Specifically answering the original question, if your data frame is named a, :

 ddply(a,~userid,summarize,facultyid=first(facultyid[!is.na(facultyid)]),
         courseid=first(courseid[!is.na(courseid)],
         schoolid=first(schoolid[!is.na(schoolid)])
PeterK
  • 1,185
  • 1
  • 9
  • 23
  • Hi Peter, thanks so much for your answer! Is there a simpler way to replicate your function if I have nearly 1000 columns/variables, or would I have to write out a first() argument for each of them? – poeticpersimmon Mar 13 '15 at 20:26