Aggregating by unique identifier and concatenating related values into a string

Question

I have a need that I imagine could be satisfied by aggregate or reshape, but I can't quite figure out.

I have a list of names (brand), and accompanying ID number (id). This data is in long form, so names can have multiple ID's. I'd like to de-dupicate by the name (brand) and concatenate the multiple possible id's into a string separated by a comment.

For example:

brand            id 
RadioShack       2308
Rag & Bone       4466
Ragu             1830
Ragu             4518
Ralph Lauren     1638
Ralph Lauren     2719
Ralph Lauren     2720
Ralph Lauren     2721
Ralph Lauren     2722

should become:

RadioShack       2308
Rag & Bone       4466
Ragu             1830,4518
Ralph Lauren     1638,2719,2720,2721,2722

How would I accomplish this?

In your desired output, "Ragu" should not appear twice, right? — Frank, May 16 '13 at 20:27
For each brand, do you want the result to be a `list` of ids, or a single string pated together? — Ricardo Saporta, May 16 '13 at 21:05

score 76 · Accepted Answer · edited Jul 02 '16 at 16:55

76

Let's call your data.frame DF

> aggregate(id ~ brand, data = DF, c)
         brand                           id
1   RadioShack                         2308
2   Rag & Bone                         4466
3         Ragu                   1830, 4518
4 Ralph Lauren 1638, 2719, 2720, 2721, 2722

Another alternative using aggregate is:

result <- aggregate(id ~ brand, data = DF, paste, collapse = ",")

This produces the same result and now id is not a list anymore. Thanks to @Frank comment. To see the class of each column try:

> sapply(result, class)
      brand          id 
   "factor" "character"

As mentioned by @DavidArenburg in the comments, another alternative is using the toString function:

aggregate(id ~ brand, data = DF, toString)

edited Jul 02 '16 at 16:55

Jaap

81,064
34
182
193

answered May 16 '13 at 20:12

Jilber Urbina

58,147
10
114
138

Very strangely, R isn't letting me output this data.frame...I think because the `id` is a list. How do I export to a CSV? – roody May 16 '13 at 20:18
1

@Jilber the class of your "id" output column should be "character" (not list), I think, since the OP wants to export the data frame. – Frank May 16 '13 at 20:23
2

+1. I like this solution as-is for my own use. I think you might want to change it to `aggregate(id~brand,paste,collapse=",",data=df)` or similar for the OP's question, though. – Frank May 16 '13 at 20:31
1

Worth mentioning `aggregate(id ~ brand, DF, toString)` too probably – David Arenburg Apr 13 '16 at 16:16
What happen if I want do the same but only for specific value?. I mean, in your example only with `id==2308`. – Henry Navarro Jan 24 '18 at 14:57
I used a similar approach, but it adds an extra space after comma. Is there a specific reason for that? – Pravellika Feb 26 '18 at 07:01
Is there a way to aggregate while also only counting unique values? I have tried something like this but it does not work: aggregate(id ~ brand, data = DF, set) – Oamar Kanji Jan 03 '22 at 07:40
This seems to work: aggregate(id ~ brand, data = DF, unique) – Oamar Kanji Jan 03 '22 at 07:50

score 44 · Answer 2 · edited Feb 11 '16 at 16:54

44

A nice clean one line in data.table

library(data.table)
setDT(DF)

TWO OPTIONS:

results as a list

DF[ , .(id = list(id)), by = brand]
          brand                       id
1:   RadioShack                     2308
2:   Rag & Bone                     4466
3:         Ragu                1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722
>

results as a string

DF[ , .(id = paste(id, collapse=",")), by = brand]
          brand                       id
1:   RadioShack                     2308
2:   Rag & Bone                     4466
3:         Ragu                1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722

Note

Even though the two results appear the same (that is when you print them, they look identical), they are in fact very different and allow for different functionality.

Namely, using the list option (the first one) allows you to then perform functions on the orignal ids.

The latter will allow you to display the information more easily (including exporting to CSV or excel), but to operate on the id's will require splicing them back.

edited Feb 11 '16 at 16:54

MichaelChirico

33,841
14
113
198

answered May 16 '13 at 21:09

Ricardo Saporta

54,400
17
144
178

Thanks! list(list(id)) works, but why? – LauriK Feb 15 '15 at 19:58
1

@LauriK, `list(list(id))` is the same as the first option in the answer. The only difference being the name in the list being explicitly given or not. The outter `list`, data.table will translate into meaning "column" (as all data.tables and data.frames are really just a list of columns). The inner `list` indicates that the value in each cell will be a list – Ricardo Saporta Feb 16 '15 at 05:29
2

Thanks! I was using `paste(id, sep = ",")` and ending up with unaggregated results. It turns out, I needed to use `paste(id, collapse = ",")` instead. – Mark Egge Sep 14 '18 at 01:38

score 32 · Answer 3 · answered May 22 '15 at 12:39

32

Or using dplyr:

library(dplyr)
DF %>%
  group_by(brand) %>%
  summarise(id = paste(id, collapse = ","))

Where DF is the name of your data.frame.

answered May 22 '15 at 12:39

Sam Firke

21,571
9
87
105

This will make the summarized columns a comma separated string type.. Not a vector... – sriharsha KB Sep 02 '16 at 08:43
2

No, just as in the other answers the result is a data.frame, with a vector called `id` of class `character` containing the strings of `id`, separated by commas as desired in the question. `is.vector(DF$id)` returns `TRUE`. If you have a question about this approach, consider posting it as a new question. – Sam Firke Sep 02 '16 at 10:30
library(dplyr) DFNew<-DF%>% group_by(brand) %>% summarize(id=unique(list(id))) This worked for me – sriharsha KB Sep 06 '16 at 06:07
@SamFirke, could you please advise how I can use your solution for only pasting unique values of `id` while there are duplicates values of `id` corresponding to a `brand`? Thanks! – Alex May 27 '21 at 07:17
2

Try adding `unique`, making it `paste(unique(id), collapse = ",")` - if that doesn't do it, it merits posting a new question. – Sam Firke May 27 '21 at 13:44

Frank · Answer 4 · 2013-05-16T20:26:53.650

Here's the information in base R:

myby <- by(df$id,df$brand,function(x)paste(x,collapse=","))

The formatting of "by" objects is weird. You can take data.frame(id=c(myby)) and the brands will become rownames:

#                                    id
# RadioShack                       2308
# Rag & Bone                       4466
# Ragu                        1830,4518
# Ralph Lauren 1638,2719,2720,2721,2722

Alternately, if you load the data.table package, this will work:

dt <- data.table(df)
dt[,paste(id,collapse=","),by=brand]
#           brand                       V1
# 1:   RadioShack                     2308
# 2:   Rag & Bone                     4466
# 3:         Ragu                1830,4518
# 4: Ralph Lauren 1638,2719,2720,2721,2722

Aggregating by unique identifier and concatenating related values into a string

4 Answers4

TWO OPTIONS:

Note

Linked

Related