how delete duplicated values based on a value in another column in r

Question

My data looks like this:

HH_ID   INDUSTRY       FREQUENCY
1002    NURSE           2
1002    DOCTOR          1
1003    NOT APPLICABLE  3
1004    ENGINEER        1
1004    CLERK           1
1004    NURSE           1

In one dataset df1 and in another data set it looks like this

HH_ID   INDUSTRY       AGE
1002    NURSE           26
1002    NURSE           25
1002    DOCTOR          34
1003    NOT APPLICABLE  40
1003    NOT APPLICABLE  28
1003    NOT APPLICABLE  23
1004    ENGINEER        35
1004    CLERK           40
1004    NURSE           24

The other data set with age is called df2 I want a data set that looks like this:

HH_ID    INDUSTRY      FREQUENCY
1002     NURSE          2
1003     NOT APPLICABLE 3
1004     CLERK          1

In other words, I want to create another dataset df3 that gives me the max frequency of industry for each HH_ID and if this is not possible because there is no maximumum value of frequency any industry associated to a HH_ID like in the case of 1004 I it to select the INDUSTRY for a HH_ID on the basis of the age of the HH_ID member based on the other dataset df2 in R. I have tried data.table package but didnt work. Please help

Possible duplicate of [How to select the maximum value in each group in R](http://stackoverflow.com/questions/24558328/how-to-select-the-maximum-value-in-each-group-in-r) — ytk, Jun 23 '16 at 03:42

bgoldst · Answer 1 · 2016-06-23T03:54:54.110

0

Here's a data.table solution that only depends on df2 (defined as a data.table, named dt2), i.e. does not require df1:

dt2[,.SD[,.(FREQUENCY=.N,MAXAGE=max(AGE)),INDUSTRY][order(FREQUENCY,MAXAGE,decreasing=T)[1L]],HH_ID];
##    HH_ID       INDUSTRY FREQUENCY MAXAGE
## 1:  1002          NURSE         2     26
## 2:  1003 NOT APPLICABLE         3     40
## 3:  1004          CLERK         1     40

You can remove the MAXAGE column afterward if you want.

Data

dt2 <- data.table(HH_ID=c(1002L,1002L,1002L,1003L,1003L,1003L,1004L,1004L,1004L),INDUSTRY=c(
'NURSE','NURSE','DOCTOR','NOT APPLICABLE','NOT APPLICABLE','NOT APPLICABLE','ENGINEER',
'CLERK','NURSE'),AGE=c(26L,25L,34L,40L,28L,23L,35L,40L,24L));

edited Jun 23 '16 at 03:54

answered Jun 23 '16 at 03:48

bgoldst

34,190
6
38
64

there are datapoints in data where NURSE age is less but frequency is high – sid satam Jun 23 '16 at 04:48
also i had to create a seperate data set to calculate frequency – sid satam Jun 23 '16 at 04:55

akrun · Accepted Answer · 2016-06-23T05:15:42.243

As it is tagged with dplyr, we can use dplyr methods. Using the second dataset ('df2'), we group by 'HH_ID', and 'INDUSTRY', get the frequency ('n()') and maximum 'AGE', use arrange to order based on variables 'FREQUENCY', 'MAXAGE', grouped by 'HH_ID', get the last observation.

 library(dplyr)
 df2 %>%
    group_by(HH_ID, INDUSTRY) %>% 
    mutate(FREQUENCY = n(), 
           MAXAGE = max(AGE)) %>% 
    arrange(FREQUENCY, MAXAGE) %>%
    group_by(HH_ID) %>% 
    slice(n()) %>%
    select(-AGE)

#  HH_ID       INDUSTRY FREQUENCY MAXAGE
#  <int>          <chr>     <int>  <int>
#1  1002          NURSE         2     26
#2  1003 NOT APPLICABLE         3     40
#3  1004          CLERK         1     40

I want a solution which uses both the datasets. – sid satam Jun 23 '16 at 06:53 — sid satam, Jun 23 '16 at 06:53

how delete duplicated values based on a value in another column in r

2 Answers2