0

I do have a statistical routine that does not like row exact duplicates (without ID) as resulting into null distances.

So I first detect duplicates which I remove, apply my routines and merge back records left aside.

For simplicity, consider I use rownames as ID/key.

I have found following way to achieve my result in base R:

data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3))

# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
dup1 <- duplicated(data)
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]]

# keep only those records that do have duplicates to preveng running folowing steps on all rows
datadup <- data[dupID,]

# "hash" row
rowhash <- apply(datadup, 1, paste, collapse="_")

idmaps <- split(rownames(datadup),rowhash)
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))

Which gives me what I want, ie deduplicated data (easy) and mapping table.

> (data <- data[!dup1,])
  x y
1 1 1
4 2 4
6 3 3
> idmaptable
      mappedid otherids
1_1.1        1        2
1_1.2        1        3
2_4          4        5

I wonder whether there is a simpler or more effective method (data.table / dplyr accepted). Any alternative to propose?

www
  • 38,575
  • 12
  • 48
  • 84
Eric Lecoutre
  • 1,461
  • 16
  • 25

3 Answers3

4

With data.table...

library(data.table)
setDT(data)

# tag groups of dupes
data[, g := .GRP, by=x:y]

# do whatever analysis
f = function(DT) Reduce(`+`, DT)
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][]

# "update join" the results back to the main table if needed
data[resDT, on=.(g), res := i.res ]

The OP skipped a central part of the example (usage of the deduped data), so I just made up f.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thanks! Impressive how concise it is. I validate this one as I intend to, rewrite part of my code to use `data.table`. What if I want another way to specify the "by" columns? I will have a global ID column (to be set as key) and I will have to first remove it from process -- as my dupliate mapping process obviously has to work without this ID column. – Eric Lecoutre Aug 04 '17 at 10:18
  • 1
    @Eric Sure. You can do `cols=setdiff(names(data), "ID")` and then pass the cols like `by=cols` and `.SDcols=cols`. The various options for passing these args are covered in `?data.table`. There are a lot of them. I also have a list in my notes http://franknarf1.github.io/r-tutorial/_book/tables.html#program-tables under "Specifying columns" – Frank Aug 04 '17 at 13:25
1

A solution using tidyverse. I usually don't store information in the row names, so I created ID and ID2 to store information. But of course, you can change that based on your needs.

library(tidyverse)

idmaptable <- data %>%
  rowid_to_column() %>%
  group_by(x, y) %>%
  filter(n() > 1) %>%
  unite(ID, x, y) %>%
  mutate(ID2 = 1:n()) %>%
  group_by(ID) %>%
  mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>%
  spread(ID_type, rowid) %>%
  fill(mappedid) %>%
  drop_na(otherids) %>%
  mutate(ID2 = 1:n())

idmaptable
# A tibble: 3 x 4
# Groups:   ID [2]
     ID   ID2 mappedid otherids
  <chr> <int>    <int>    <int>
1   1_1     1        1        2
2   1_1     2        1        3
3   2_4     1        4        5
www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks. Nice for the exercise! I will validate data.table option as I finally intend to use this package. – Eric Lecoutre Aug 04 '17 at 10:11
  • Note that manipulations are tricky, in the sense there are many steps and the logic is not so easy to read/decompose/understand! – Eric Lecoutre Aug 04 '17 at 10:19
  • Thanks for the comments. Tricky or not depends on how the users feel it. In my solution, each step is a function that does one thing and one thing only. If you know what each function represents, you can "read it out loud". To me, sometimes those concise approaches are too "compact". – www Aug 04 '17 at 10:32
1

Some improvements to your base R solution,

df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),]

do.call(rbind, lapply(split(rownames(df), 
               do.call(paste, c(df, sep = '_'))), function(i) 
                                                  data.frame(mapped = i[1], 
                                                             others = i[-1], 
                                                             stringsAsFactors = FALSE)))

Which gives,

     mapped others
1_1.1      1      2
1_1.2      1      3
2_4        4      5

And of course,

unique(data)

 x y
1 1 1
4 2 4
6 3 3
Sotos
  • 51,121
  • 6
  • 32
  • 66