0

I'm looking to create an ID variable so I can group 2 rows together to identify that no pair of 2 rows is the same.

Currently My data frame looks like this

Column a     Column b     Column c

Tom            K             Adam
Greg           R             Adam
Troy           K             Lisa
Daniel         R             Lisa
John           K             Greg
Tom            R             Greg
Lisa           K             Adam
Greg           R             Adam

I want it to include an extra column that has unique ID variables so each pair of rows (according to column c) can be grouped together.

E.g.

Column a     Column b     Column c     Column d

Tom            K             Adam        3226
Greg           R             Adam        3226
Troy           K             Lisa        3227
Daniel         R             Lisa        3227
John           K             Greg        3228
Tom            R             Greg        3228
Tom            K             Adam        3229
Greg           R             Adam        3229

Important to note that Adam appears 4 times in Column c but each pairing of his name creates a new unique ID

Any help would be much appreciated.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    How do you get unique numbers `3226, 3227... ` etc? Can it be also `1, 2, 3..` ? looks like you need `data.table::rleid(df$columnC)` . – Ronak Shah Jun 21 '19 at 03:22
  • no problem, simply convert to factor afterwards, e.g. `factor(c(1, 2, 3), labels=c(3226, 3227, 3228))`. – jay.sf Jun 21 '19 at 04:01
  • I just made those numbers up as an example. The ID variables could be anything like `1, 2, 3` or `3226, 3227, 3228` so long as they can differentiate between the different pairs of rows. – I.Di Domenico Jun 21 '19 at 05:13
  • That code you provided Ronak works well, thank you! – I.Di Domenico Jun 21 '19 at 05:23

1 Answers1

0

We can check whether the adjacent rows elements of 'Columnc' are equal, get the cumulative sum of logical vector and add 3226

library(dplyr)
df1 %>% 
    mutate(Columnd =  3226 + cumsum(Columnc != lag(Columnc,
           default = first(Columnc))))
#   Columna Columnb Columnc Columnd
#1     Tom       K    Adam    3226
#2    Greg       R    Adam    3226
#3    Troy       K    Lisa    3227
#4  Daniel       R    Lisa    3227
#5    John       K    Greg    3228
#6     Tom       R    Greg    3228
#7    Lisa       K    Adam    3229
#8    Greg       R    Adam    3229

Or with base R using rle

df1$Columnd <-  with(rle(df1$Columnc), rep(seq_along(values), lengths))

data

df1 <- structure(list(Columna = c("Tom", "Greg", "Troy", "Daniel", "John", 
"Tom", "Lisa", "Greg"), Columnb = c("K", "R", "K", "R", "K", 
"R", "K", "R"), Columnc = c("Adam", "Adam", "Lisa", "Lisa", "Greg", 
"Greg", "Adam", "Adam")), class = "data.frame", row.names = c(NA, 
-8L))
akrun
  • 874,273
  • 37
  • 540
  • 662