2

I have a sequence of numeric labels for records that can be shared by a variable number of records per label (labelsequence). I also have the records themselves, but unfortunately for some of the sequence values, all records have been lost (dataframe df). I need to identify when a numeric label from labelsequence does not appear in the label column of df, copy all records within df that are associated with the closest label value that is less than the missing value, and append these to a newly filled-in dataframe, say df2.

I am trying to accomplish this in R (a dplyr answer would be ideal), and have looked at answers to questions regarding filling in missing rows, such as Fill in missing rows in R and fill missing rows in a dataframe, and have a working solution below, was wondering if anyone has a better way of doing this.

Take , for instance, this example data:

labelsequence<-data.frame(label=c(1,2,3,4,5,6))

and

df<-data.frame(label=c(1,1,1,1,3,3,4,4,4),
           place=c('vermont','kentucky',
                   'wisconsin','wyoming','nevada',
                   'california','utah','georgia','kentucky'),
           animal=c('wolf','wolf','cougar','cougar','lamb',
                    'cougar','donkey','lamb','wolf'))

with desired result...

desired_df2<-data.frame(label=c(1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6),
            place=c('vermont','kentucky',
                    'wisconsin','wyoming','vermont','kentucky',
                    'wisconsin','wyoming','nevada',
                    'california','utah','georgia','kentucky','utah',
                    'georgia','kentucky','utah','georgia','kentucky'),
            animal=c('wolf','wolf','cougar','cougar','wolf',
                     'wolf','cougar','cougar','lamb','cougar',
                     'donkey','lamb','wolf','donkey','lamb','wolf',
                     'donkey','lamb','wolf'))

Is there a better (be it effiency of code, flexibility, or resource efficiency) way than the following?

df2<- df %>%
full_join(expand.grid(label=unique(df$label),newlabel=labelsequence$label)) %>%
  mutate(missing = ifelse(newlabel %in% label,0,1))%>%
  filter(label<newlabel)%>%
   group_by(newlabel) %>%
  filter(label==max(label) & missing ==1) %>%
  ungroup()%>%
  mutate(label=newlabel,missing=NULL,newlabel=NULL) %>%
  bind_rows(df) %>%
  arrange(label)
Pake
  • 968
  • 9
  • 24

0 Answers0