0

For example, I have the following dataset (my real dataset has more than 100000 rows and 70 variables):

Country   Year   Flag
Norway    2018   drop: reason1
Norway    2018   drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2
France    2011   drop: reason3
France    2011   drop: reason4

Firstly, I want to group Flag values by variables Country and Year, so I want to get a table like this:

Country   Year   Flag
Norway    2018   drop: reason1, drop: reason2
Sweden    2016   drop: reason3
France    2011   drop: reason2, drop: reason3, drop: reason4

Secondly, if there are more than one value in the Flag column, I want to leave only 1 with the following logic: if the drop: reason1 is present, then leave it and remove the rest. If there is no drop: reason1, but there is a drop: reason2 and a drop: reason3, then we leave only the drop: reason2.

Finally, my dataset should look like this:

Country   Year   Flag
Norway    2018   drop: reason1
Sweden    2016   drop: reason3
France    2011   drop: reason2

I would like to implement this based on the data.table or base R approach.

I would be very grateful for any help! At least for the first part of the question.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Hilary
  • 475
  • 3
  • 10

1 Answers1

3

We can order the data by Country and Flag and then select the 1st value of Flag for each Country and Year.

This can be done in base R:

aggregate(Flag~Country+Year, df[with(df, order(Country, Flag)), ], head, 1)

#  Country Year         Flag
#1  France 2011 drop:reason2
#2  Sweden 2016 drop:reason3
#3  Norway 2018 drop:reason1

Or dplyr

library(dplyr)

df %>%
  arrange(Country, Flag) %>%
  group_by(Country, Year) %>%
  summarise(Flag = first(Flag))

and with data.table

library(data.table)
setDT(df)
df[order(Country, Flag), (Flag = first(Flag)), .(Country, Year)]

data

df <- structure(list(Country = structure(c(2L, 2L, 3L, 1L, 1L, 1L),
.Label = c("France","Norway", "Sweden"), class = "factor"), Year = c(2018L, 2018L, 
2016L, 2011L, 2011L, 2011L), Flag = structure(c(1L, 2L, 3L, 2L, 
3L, 4L), .Label = c("drop:reason1", "drop:reason2", "drop:reason3", 
"drop:reason4"), class = "factor")), class = "data.frame", row.names = c(NA, -6L))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks a lot!!! And how can I get the first table from my question (with **all** possible unique variants for the flag value)? Sorry I'm new in R :(( – Hilary Jan 13 '20 at 06:47
  • 1
    @Hilary you could do `aggregate(Flag~Country+Year, df, function(x) toString(unique(x)))` – Ronak Shah Jan 13 '20 at 06:53