0

I have a dataframe that contains 3 columns. One of the columns (items) includes long strings (some include special characters such as semicolon ; for simplicity it is written here as Tx) that might repeat. I would like to find all the unique values of the specific column. For each unique value to create a nickname in the following format: OV1, OV2,... OVn. I would like to keep this table and to call it 'transformation table'. In the next step I would like to go to the source table and to recode each value with the value in transformation table.

Here is an example for the source table:

    items  sequenceID        eventID SIZE  
1:   T1      41595370             1    1
2:   T2      41595371             1    1
3:   T3      41595282             1    1
4:   T3      41595282             2    1
5:   T4      41595373             1    1
6:   T5      41595368             1    1
7:   T1      41595379             1    1

The result table would be:

     items     sequenceID      eventID SIZE  
1:   OV1       41595370             1    1
2:   OV2       41595371             1    1
3:   OV3       41595282             1    1
4:   OV3       41595282             2    1
5:   OV4       41595373             1    1
6:   OV5       41595368             1    1
7:   OV1       41595379             1    1

The transformation table would be:

1:   T1    OV1
2:   T2    OV2
3:   T3    OV3
4:   T4    OV4
5:   T5    OV5

The source data is in data6 (dataframe). I used the following orders:

u1 <- unique(data6$items)
u1 <- data.frame(u1)
ov <- c(paste("ov",1:nrow(u1),sep=""))

I would now to replace all the unique items column that resides in u1 with OV values and to create this as transformation table. Afterwards to replace all the values in data6$items.

Thanks a lot for your help.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Avi
  • 2,247
  • 4
  • 30
  • 52

2 Answers2

1

You can create the transformation table with:

u1 <- as.data.frame(unique(data6$items))
names(u1) <- "items"
u1$newitem <- paste0("OV", substr(u1$items,2,4))

this gives:

> u1
  items newitem
1    T1     OV1
2    T2     OV2
3    T3     OV3
4    T4     OV4
5    T5     OV5

You can now replace the relevant items in the source table with:

data6$items = u1$newitem[match(data6$items,u1$items)]

this gives:

> data6
  items sequenceID eventID SIZE
1   OV1   41595370       1    1
2   OV2   41595371       1    1
3   OV3   41595282       1    1
4   OV3   41595282       2    1
5   OV4   41595373       1    1
6   OV5   41595368       1    1
7   OV1   41595379       1    1

If you only want to change the T with OV, you could also just do:

data6$items <- gsub("T", "OV", data6$items)
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thanks a lot. But how can I replace the relevant items values in the source table? – Avi Aug 21 '15 at 05:41
  • Thanks. However, the item content is not so simple as I showed here, It doesn't contain only Tx letters it contains long string that might contain even semicolon (;). I would like to change every same value to the same value in format OVx (i.e. OV1, OV2...) so if the same string appears in items in different rows it will get the same OVi. – Avi Aug 21 '15 at 07:11
  • @Avi Could you include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in question with those long strings and how you want them to be? – Jaap Aug 21 '15 at 07:13
  • Sure here is an example for the item of the first line (should be considered as OV1): BA31C1CC63E5043483FAE25F085E25E5 INSERT and here is an example for the second line (the whole value should be OV2): 9D7F7E3AD871B6393B434427B6B8232C SELECT;314CD91DCA8849C64DCEACBA2E3B65B7 SELECT – Avi Aug 21 '15 at 11:07
  • @Avi please edit your question and include the necessary information there; it's a bit hard from that comment to guess what you want – Jaap Aug 26 '15 at 06:50
1

It looks like your "item" can be seen as a factor variable which needs new labels. If so you can avoid the transformation table and recode the "item" within your source dataframe:

data6$items = factor(data6$items, labels=paste0("OV", 1:length(unique(data6$items))))
MarkusN
  • 3,051
  • 1
  • 18
  • 26
  • Thanks a lot. It seems the command works. However, if I would like to change the OV (new items values) to the following format: OV1, OV2 etc, What do I have to change in the command? and in addition I get the following warning: - why? - Warning message: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : duplicated levels in factors are deprecated – Avi Aug 21 '15 at 06:58
  • I get the following after running the command: items sequenceID eventID SIZE 1: OV4CC 41595370 1 1 2: OVFCD 41595371 1 1 3: OV6A6 41595282 1 1 4: OV6A6 41595282 2 1 5: OV49A 41595373 1 1 6: OV990 41595368 1 1 – Avi Aug 21 '15 at 07:00
  • I changed the code, now the levels of the newly creared variable are OV1..OVn. I don't get any warning message using your data, maybe you have to define variable as.character() first. – MarkusN Aug 21 '15 at 08:59