2

This question has been asked but I'm looking for a more complete answer/slightly modified output.

I have a dataset with Lat and Long values in separate columns and want to create a unique ID for each unique combination of Lat and Long.

I'll borrow an example dataset from an older post asking the same question but for which I need a slightly different solution (Add ID column by group).

d <- read.table(text='LAT LONG
13.5330 -15.4180 
13.5330 -15.4180 
13.5330 -15.4180 
13.5330 -15.4180 
13.5330 -15.4170 
13.5330 -15.4170 
13.5330 -15.4170 
13.5340 -14.9350 
13.5340 -14.9350 
13.5340 -15.9170 
13.3670 -14.6190', header=TRUE)

The solution given was:

d <- transform(d, Cluster_ID = as.numeric(interaction(LAT, LONG, drop=TRUE)))

#       LAT    LONG Cluster_ID
# 1  13.533 -15.418          2
# 2  13.533 -15.418          2
# 3  13.533 -15.418          2
# 4  13.533 -15.418          2
# 5  13.533 -15.417          3
# 6  13.533 -15.417          3
# 7  13.533 -15.417          3
# 8  13.534 -14.935          4
# 9  13.534 -14.935          4
# 10 13.534 -15.917          1
# 11 13.367 -14.619          5

But how do you get the interaction command to preserve order so that the first Cluster_ID above would be 1 (full vector for last column would be 1,1,1,1,2,2,2,3,3,4,5 instead of 2,2,2,2,3,3,4,4,1,5)? It's unclear how the new factor order (converted to numeric) is determined.

I have also been trying to find equivalent way of doing this using group_by in dplyr but can't figure out how to output the tibble table as a dataframe (older solutions on SO seem to use depreciated dplyr commands).

Thanks!

user2414840
  • 721
  • 1
  • 7
  • 15

2 Answers2

2

We could use match

transform(d, Cluster_ID = match(paste0(LAT, LONG), unique(paste0(LAT, LONG))))

Or convert the 'LAT', 'LONG' to sequence and then do the interaction

transform(d, Cluster_ID = as.integer(interaction(match(LAT, 
  unique(LAT)),  match(LONG, unique(LONG)), drop=TRUE, lex.order = FALSE)))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

A data.table option using .GRP

> setDT(d)[, Cluster_ID := .GRP, .(LAT, LONG)][]
       LAT    LONG Cluster_ID
 1: 13.533 -15.418          1
 2: 13.533 -15.418          1
 3: 13.533 -15.418          1
 4: 13.533 -15.418          1
 5: 13.533 -15.417          2
 6: 13.533 -15.417          2
 7: 13.533 -15.417          2
 8: 13.534 -14.935          3
 9: 13.534 -14.935          3
10: 13.534 -15.917          4
11: 13.367 -14.619          5

or rleid (thank @akrun's comment)

> setDT(d)[, Cluster_ID := rleid(LAT, LONG)][]
       LAT    LONG Cluster_ID
 1: 13.533 -15.418          1
 2: 13.533 -15.418          1
 3: 13.533 -15.418          1
 4: 13.533 -15.418          1
 5: 13.533 -15.417          2
 6: 13.533 -15.417          2
 7: 13.533 -15.417          2
 8: 13.534 -14.935          3
 9: 13.534 -14.935          3
10: 13.534 -15.917          4
11: 13.367 -14.619          5

Or a base R option using ave + cumsum

transform(
  d,
  Cluster_ID = cumsum(
    ave(1:nrow(d),
      LAT,
      LONG,
      FUN = seq_along
    ) == 1
  )
)
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81