-2

I need to embed a condition in a remove duplicates function. I am working with large student database from South Africa, a highly multilingual country. Last week you guys gave me the code to remove duplicates caused by retakes, but I now realise my language exam data shows some students offering more than 2 different languages. The source data, simplified looks like this

STUDID   MATSUBJ     SCORE
101      AFRIKAANSB   1
101      AFRIKAANSB   4
102      ENGLISHB     2
102      ISIZULUB     7
102      ENGLISHB     5

The result file I need is

STUDID   MATSUBJ    SCORE  flagextra
101      AFRIKAANS   4
102      ENGLISH     5
102      ISIZULUB    7     1

I need to flag the extra language so that I can see what languages they are and make new category for this

YOLO
  • 20,181
  • 5
  • 20
  • 40
CharlotteM
  • 47
  • 4
  • so extra language is the one which occurs just one time ? – YOLO Dec 31 '18 at 11:20
  • 1
    Can you show some effort solving this problem? This is very similar to [your previous question](https://stackoverflow.com/questions/53964950/is-there-an-r-function-for-dropping-duplicates-of-index-variable-based-on-lowest) which has answers. – pogibas Dec 31 '18 at 11:25
  • @PoGibas This my second question adds the complication of a condition to the earlier one about duplication. I have been using the answers to my first question, but hit a problem with the real data which requires this extra condition function – CharlotteM Dec 31 '18 at 14:46

2 Answers2

1

Two stage procedure works better for me as a newbie to R:

1- remove the duplicates caused by subject retakes:

df <- LANGSEC%>%
     group_by(STUDID,MATRICSUBJ) %>%
     top_n(1,SUBJSCORE) 

2- Then flag one of the two subjects causing the remaining duplicates:

LANGSEC$flagextra <- as.integer(duplicated(LANGSEC$STUDID),LANGSEC$MATRICSUBJ 

Then filter for this third language and make new file:

LANG3<-LANGSEC%>% filter(flagextra==1)

Then remove these from the other file:

LANG2<-LANGSEC %>% filter(!flagextra==1)                                                                            
Archeologist
  • 169
  • 1
  • 11
CharlotteM
  • 47
  • 4
0

May be this helps

library(tidyverse)
df1 %>% 
   group_by(STUDID, MATSUBJ) %>% 
   summarise(SCORE = max(SCORE), 
             flagextra = as.integer(!sum(duplicated(MATSUBJ))))
# A tibble: 3 x 4
# Groups:   STUDID [?]
#  STUDID MATSUBJ    SCORE flagextra
#   <int> <chr>      <dbl>     <int>
#1    101 AFRIKAANSB     4         0
#2    102 ENGLISHB       5         0
#3    102 ISIZULUB       7         1

Or with base R

i1 <- !(duplicated(df1[1:2])|duplicated(df1[1:2], fromLast = TRUE))
transform(aggregate(SCORE ~ ., df1, max), 
          flagextra = as.integer(MATSUBJ %in% df1$MATSUBJ[i1]))

data

df1 <- structure(list(STUDID = c(101L, 101L, 102L, 102L, 102L), MATSUBJ 
      = c("AFRIKAANSB", 
 "AFRIKAANSB", "ENGLISHB", "ISIZULUB", "ENGLISHB"), SCORE = c(1L, 
 4L, 2L, 7L, 5L)), class = "data.frame", row.names = c(NA, -5L
 ))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Am still getting errors when try on real data as below (NB Lang2=MATSUBJ . L2score=score) – CharlotteM Dec 31 '18 at 14:31
  • @CharlotteM Without the error messages, it is not clear what the issue – akrun Dec 31 '18 at 14:32
  • LANG2 %>% group_by (STUDID,Lang2) %>% summarise(L2score=max(L2score),flagextra-as.integer(!sum(duplicated(Lang2)))) Error in summarise_impl(.data, dots) : Evaluation error: ‘max’ not meaningful for factors. – CharlotteM Dec 31 '18 at 14:53
  • i1<-!(duplicated (LANG2[1:2]|duplicated (LANG2[1:2],fromLast=TRUE))transform (aggregate (L2score~.,LANG2,max),flagextra=as.integer(Lang2 %>% LANG2$Lang2 [i1])) Error: unexpected symbol in "i1<-!(duplicated (LANG2[1:2]|duplicated (LANG2[1:2],fromLast=TRUE))transform" – CharlotteM Dec 31 '18 at 14:54
  • @CharlotteM The error is pretty much clear. You have a `factor` column. Based on the input showed, I assume it as `numeric`. You may need to convert it to `numeric` first i.e. `LANG2$L2score <- as.numeric(as.character(LANG2$L2score))` – akrun Dec 31 '18 at 14:56
  • It is showing some unexpected symbol. Perhaps you have submitted the code with the R console `>` symbol as well – akrun Dec 31 '18 at 17:50