Remove (quasi) identical rows

Question

in the following data.df we see that lines 2 and 3 are identical and just the mean of line 4 differs.

    iso3 dest   code year          uv       mean
1    ALB  AUT 490700 2002  14027.2433 427387.640
2    ALB  BGR 490700 2002   1215.5613  11886.494
3    ALB  BGR 490700 2002   1215.5613  11886.494
4    ALB  BGR 490700 2002   1215.5613  58069.405
5    ALB  BGR 843050 2002    677.9827   4272.176
6    ALB  BGR 851030 2002  31004.0946  32364.379
7    ALB  HRV 392329 2002   1410.0072   6970.930

Is there any easy way to automatically find these same rows? I found this subject which seems to answer to this question but I do not understand how 'duplicated()` works...

What I would like is a "simple" command where I could precise which column value by row should be identical. something like : function(data.df, c(iso3, dest, code, year, uv, mean)) to find the very same rows and function(data.df, c(iso3, dest, code, year, uv)) to find the "quasi" same rows...

the expected result being something like, in the first case:

2    ALB  BGR 490700 2002   1215.5613  11886.494
3    ALB  BGR 490700 2002   1215.5613  11886.494

and in the second one:

2    ALB  BGR 490700 2002   1215.5613  11886.494
3    ALB  BGR 490700 2002   1215.5613  11886.494
4    ALB  BGR 490700 2002   1215.5613  58069.405

any idea?

You should try `dplyr::distinct(data.df, iso3, dest, code, year, uv, .keep_all =TRUE)` — kath, Mar 26 '19 at 12:29

score 3 · Accepted Answer · answered Mar 26 '19 at 12:29

3

We could write a function and then pass columns which we want to consider.

get_duplicated_rows <- function(df, cols) {
  df[duplicated(df[cols]) | duplicated(df[cols], fromLast = TRUE), ]
}

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"))

# iso3 dest   code year     uv  mean
#2  ALB  BGR 490700 2002 1215.6 11886
#3  ALB  BGR 490700 2002 1215.6 11886

get_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv"))
#  iso3 dest   code year     uv  mean
#2  ALB  BGR 490700 2002 1215.6 11886
#3  ALB  BGR 490700 2002 1215.6 11886
#4  ALB  BGR 490700 2002 1215.6 58069

answered Mar 26 '19 at 12:29

Ronak Shah

377,200
20
156
213

And if then I want to remove the duplicates? do you know an easy way to do so once they are identified? (in order to keep just one of them...) – TeYaP Mar 26 '19 at 12:34
1

@TeYaP If you want to keep only one of the duplicates, remove the `| duplicated(df[cols], fromLast = TRUE` part from the function and it will keep only one row. – Ronak Shah Mar 26 '19 at 12:38

score 1 · Answer 2 · answered Mar 26 '19 at 13:27

You can get at the quasi duplications if you look at each feature one by one and then, you consider rows with a Rowsum greater than your target value.

toread <- "    iso3 dest   code year          uv       mean
    ALB  AUT 490700 2002  14027.2433 427387.640
    ALB  BGR 490700 2002   1215.5613  11886.494
    ALB  BGR 490700 2002   1215.5613  11886.494
    ALB  BGR 490700 2002   1215.5613  58069.405
    ALB  BGR 843050 2002    677.9827   4272.176
    ALB  BGR 851030 2002  31004.0946  32364.379
    ALB  HRV 392329 2002   1410.0072   6970.930"

df <- read.table(textConnection(toread), header = TRUE) 
closeAllConnections()

get_quasi_duplicated_rows <- function(df, cols, cut){
  result <- matrix(nrow = nrow(df), ncol = length(cols))
  colnames(result) <- cols
  for(col in cols){
    dup <- duplicated(df[col]) | duplicated(df[col], fromLast = TRUE)
    result[ , col] <- dup
  }
  return(df[which(rowSums(result) > cut), ])
}

get_quasi_duplicated_rows(df, c("iso3", "dest", "code", "year", "uv","mean"), 4)


 iso3 dest   code year       uv     mean
2  ALB  BGR 490700 2002 1215.561 11886.49
3  ALB  BGR 490700 2002 1215.561 11886.49
4  ALB  BGR 490700 2002 1215.561 58069.40

Rushabh Patel · Answer 3 · 2019-03-26T13:44:15.547

Using dplyr and rlang package we can achive this-

Solution-

find_dupes <- function(df,cols){
  df <- df %>% get_dupes(!!!rlang::syms(cols))
  return(df)
}

Output-

1st Case-

> cols
[1] "iso3" "dest" "code" "year" "uv" 

> find_dupes(df, cols)

# A tibble: 3 x 7
  iso3  dest    code  year    uv dupe_count   mean
  <fct> <fct>  <int> <int> <dbl>      <int>  <dbl>
1 ALB   BGR   490700  2002 1216.          3 11886.
2 ALB   BGR   490700  2002 1216.          3 11886.
3 ALB   BGR   490700  2002 1216.          3 58069.

2nd Case-

> cols
[1] "iso3" "dest" "code" "year" "uv"   "mean"

> find_dupes(df,cols)

# A tibble: 2 x 7
  iso3  dest    code  year    uv   mean dupe_count
  <fct> <fct>  <int> <int> <dbl>  <dbl>      <int>
1 ALB   BGR   490700  2002 1216. 11886.          2
2 ALB   BGR   490700  2002 1216. 11886.          2

Note-

rlan::syms function take strings as input and turn them into symbols. Contrarily to as.name(), they convert the strings to the native encoding beforehand. This is necessary because symbols remove silently the encoding mark of strings.

To pass a list of vector names in dplyr function, we use syms.

!!! is used to unquote

score 1 · Answer 4 · answered Mar 26 '19 at 14:04

We can use group_by_all and filter that having more than 1 frequency count

library(dplyr)
df1 %>%
   group_by_all() %>% 
   filter(n() > 1)
# A tibble: 2 x 6
# Groups:   iso3, dest, code, year, uv, mean [1]
#  iso3  dest    code  year    uv   mean
#  <chr> <chr>  <int> <int> <dbl>  <dbl>
#1 ALB   BGR   490700  2002 1216. 11886.
#2 ALB   BGR   490700  2002 1216. 11886.

if it is a subset of columns, use group_by_at

df1 %>%
     group_by_at(vars(iso3, dest, code, year, uv)) %>%
     filter(n() > 1)
# A tibble: 3 x 6
# Groups:   iso3, dest, code, year, uv [1]
#  iso3  dest    code  year    uv   mean
#  <chr> <chr>  <int> <int> <dbl>  <dbl>
#1 ALB   BGR   490700  2002 1216. 11886.
#2 ALB   BGR   490700  2002 1216. 11886.
#3 ALB   BGR   490700  2002 1216. 58069.

Remove (quasi) identical rows

4 Answers4