1

I have a data frame that contains a large amount of duplicate personal-ids by personnel working at different firms.

I would like to use duplicated for the personal-ids for every different quarter within every different firm.

Example of what my data looks like

Basically, I would like R to run duplicated only for the ids within quarter 1 for firm 1, then for quarter 2 for firm 1, then quarter 3 for firm 1 etc. until the codes for firm 1 are done, and then move on to firm 2 where it runs duplicated again for quarter 1 for firm 2 etc.

Example of the desired result

My main issue is: If person A shows up in multiple different quarters or in multiple firms, the first time his/her id shows up within every quarter/firm should NOT be considered a duplicate value.

Any input would be immensely appreciated and I will make sure to be swift with whatever clarifications may be needed. Thank you.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Chris Cha
  • 11
  • 2

1 Answers1

0

This is easily done with any package that facilitates convenient grouping, for instance, "data.table" or "dplyr".

Examples:

library(data.table)
as.data.table(mydf)[, duplicates := duplicated(id), .(firm)][]
#     firm quarter id duplicates
#  1:    1       1  A      FALSE
#  2:    1       2  B      FALSE
#  3:    1       3  A       TRUE
#  4:    1       4  C      FALSE
#  5:    1       5  B       TRUE
#  6:    2       1  A      FALSE
#  7:    2       1  B      FALSE
#  8:    2       2  B       TRUE
#  9:    3       1  D      FALSE
# 10:    3       2  E      FALSE
# 11:    3       2  F      FALSE
# 12:    3       3  E       TRUE
# 13:    4       1  A      FALSE
# 14:    4       1  C      FALSE

or

library(tidyverse)
mydf %>% 
  group_by(firm) %>% 
  mutate(duplicates = duplicated(id))
# # A tibble: 14 x 4
# # Groups:   firm [4]
#     firm quarter id    duplicates
#    <dbl>   <dbl> <chr> <lgl>     
#  1  1.00    1.00 A     F         
#  2  1.00    2.00 B     F         
#  3  1.00    3.00 A     T         
#  4  1.00    4.00 C     F         
#  5  1.00    5.00 B     T         
#  6  2.00    1.00 A     F         
#  7  2.00    1.00 B     F         
#  8  2.00    2.00 B     T         
#  9  3.00    1.00 D     F         
# 10  3.00    2.00 E     F         
# 11  3.00    2.00 F     F         
# 12  3.00    3.00 E     T         
# 13  4.00    1.00 A     F         
# 14  4.00    1.00 C     F 

or, in base R:

within(mydf, {
  duplicates = ave(id, firm, FUN = duplicated)
})

Sample data:

mydf <- structure(list(firm = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 
    4), quarter = c(1, 2, 3, 4, 5, 1, 1, 2, 1, 2, 2, 3, 1, 1), id = c("A", 
    "B", "A", "C", "B", "A", "B", "B", "D", "E", "F", "E", "A", "C"
    )), .Names = c("firm", "quarter", "id"), row.names = c(NA, 14L
    ), class = "data.frame")
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485