How to get a new column in a data frame which has only elements which appear in the set more than once in R

Question

Data:

DB1 <- data.frame(orderItemID  = c(1,2,3,4,5,6,7,8,9,10), 
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),  
itemID = c(2,3,2,5,12,4,2,3,1,5),  
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")), 
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),  
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))

Expected output:

selection_order = c("yes","no","no","no","no","no","yes","no","no","no")

In the data set I have items with the same size or the same color, the same ItemID. Every registered user has his unique customerID.

I want to identify when a user orders products (more then one) with the same itemID (in different sizes or colors = for example the user with the customerID = 33 orders the same item (ItemID = 2) in two different colors) and mark it in a new column named like "selection order"(for example) with "Yes" or "No". It should NOT show me a "Yes", when he or she orders an item with an other ID. I just want to get a "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products).

I've tried a lot already,but nothing works. There are a few thousand different userID's and ItemId's-so I can´t subset for every Id. I tried it with the duplicated function - but it's not leading to a satisfactory solution:

The problem is, that if the same person orders more then one object (customerID is duplicated then) and another person(customerId) orders an item with the same Id (itemId is duplicated then) it gives me a "yes": and it must be a "No" in this case. (in the example the duplicate function will give me an "yes" at orderItemID 4 instead of an "no")

Please make a simple reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example - we're not that awesome that we're going to waste time trying to create a data set based on your wordy description when you could supply code. — Spacedman, Oct 28 '14 at 14:38
No example, so just a pointer: the `?ave` function might be helpful or maybe `?table`. Check those out and make your question reproducible with expected output. — talat, Oct 28 '14 at 14:42
Tried my best do put it into a form you can tolerate (I hope) — Jarvis, Oct 28 '14 at 21:50
This is not what I wanted. I did not want to transfer the data to another table or sort the data. The records are to remain in their order. It should only be introduced an additional column that a "right" / "wrong"; "yes" / "no" .. outputs. It should show "right" "yes" ... if the (respective) order is an item that a the same person ordered in several colors and/or sizes (ie the same Item ID from the same person more than once) was ordered — Jarvis, Oct 29 '14 at 18:53
Ok, I've edited my answer, if this isn't what you looking for, then I give up on this — David Arenburg, Oct 29 '14 at 19:23
Thanks till now David - now I (think) get what´s the communication problem: it was my mistake (of course). Now it´s gonna show me "yes" everytime a person orders more than one item (sometimes of the same itemID) but it also shows me when he or she orders an item with an other ID, too. I just want to get an "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products). I hope you have not lost patience with me already :) — Jarvis, Oct 29 '14 at 20:34
Can you provide what the output should be given the data frame you provided? — n8sty, Oct 30 '14 at 14:06
Output: "Yes","No","No","No","No","No", "Yes", "No","No","No": Only "Yes", when one costumerID orders the same itemID more then once-ALSO "No", when one customerID orders different items (with diffent ItemIDs) — Jarvis, Oct 30 '14 at 22:06
This endless wording doesn't help. Just provide the actual values that `selection_order` column should have for the provided data set — David Arenburg, Oct 31 '14 at 10:37

David Arenburg · Answer 1 · 2014-11-01T19:25:44.350

I think I understand what is your desired output now, try

library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
#     orderItemID orderDate itemID size  color customerID selection_order
#  1:           1    1.1.12      2    l   blue         33            TRUE
#  2:           2    1.1.12      3    s  black         15           FALSE
#  3:           3    1.1.12      2   xl   blue          1           FALSE
#  4:           4    1.1.12      5   xs orange         33           FALSE
#  5:           5    1.1.12     12    m    red         14           FALSE
#  6:           6    1.1.12      4    s   navy         55           FALSE
#  7:           7    1.1.12      2    l    red         33            TRUE
#  8:           8    1.1.12      3    m purple         78           FALSE
#  9:           9    2.1.12      1  xxs  white         94           FALSE
# 10:          10    2.1.12      5  xxl  black         23           FALSE

In order to convert back to a data.frame, use DB1 <- as.data.frame(DB1) (for older versions) or setDF(DB1) for the lates data.table version.

You can do it (less efficiently) with base R too

transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))

Or using the dplyr package

library(dplyr)
DB1 %>%
  group_by(customerID, itemID) %>%
  mutate(selection_order = n() > 1)

The only way we will understand what you actually want is providing the actual desired output (not in words, but with actual data set) for the provided data (similar to how you provided your data set) instead of keep posting duplicate questions when you keep providing confusing information — David Arenburg, Oct 31 '14 at 10:35
Yes-that´s it! But is there another possibility without turning the data frame into a data.table?(or how can I turn it back into a data frame afterwards again? David, you´re my hero for today :) Thanks a lottle — Jarvis, Oct 31 '14 at 16:18

score 0 · Answer 2 · answered Oct 30 '14 at 14:35

The following code will append a new column selection.order to your data frame if the row represents a duplicate (customerID, itemID) tuple.

# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))

# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]

# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)

score 0 · Answer 3 · answered Oct 30 '14 at 16:01

0

If you just want the subset, as you say in the title, then do this:

DB1[duplicated(DB1[c("itemID", "customerID")]),]

If you want the column, then:

f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L

Note that is also easy to get the actual count by simplifying the last line above.

answered Oct 30 '14 at 16:01

Michael Lawrence

1,031
5
6

will not work-shows also yes, when the ID is orderd by another user – Jarvis Oct 30 '14 at 22:07
The problem with this is now, that if the same person orders more then one object (customerID is dublicated then) and another person(customerID) orders an item with the same ID (itemID is dublicated then) it gives me a yes: and it must be a "No" – Jarvis Oct 30 '14 at 23:10

How to get a new column in a data frame which has only elements which appear in the set more than once in R

3 Answers3

Linked

Related