Merge frequency table with existing data frame based on ID

Question

What I have:

From my original observations …

video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak

… I have a table with frequencies (called tab), and it is displayed in the exact format I want, e.g.

video_id  foo bar baz yak blah
1         4   0   0   0   0
2         0   4   0   0   0
3         0   0   2   0   2
4         0   0   0   4   0

I'd like to merge this table with an existing data frame (called data), based on the ID column. So, for example, it contains two other columns as well:

video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763

What I need:

I need to merge the frequency table and the existing data frame based on the video ID. Note that it is not necessarily sorted, so I can't just cbind them. This is the result I need:

video_id  col1  col2  foo bar baz yak blah
1         123   412   4   0   0   0   0
2         652   633   0   4   0   0   0
3         749   144   0   0   2   0   2
4         1738  1763  0   0   0   4   0

Now, I know I can get a data frame matrix like this:

as.data.frame.matrix(table(…))

But this matrix is missing the video_id column, which is actually displayed when I just view the table. So, how do I go about getting a data frame that still includes the video_id column—or row names, that is?

I need the video_id column to come first in the data frame, then the original columns, and then the tabular data appended, as seen in the example above.

What I've tried:

I know I can get the table's row names through rownames(table(…)), and I can get the result I want with
```
cbind(data.frame(video_id=rownames(tab)), as.data.frame.matrix(tab))
```
But this doesn't seem clean (enough) to me.
Merging directly with
```
merge(data, as.data.frame.matrix(tab))
```
gives me all the results, but the video_id column is between the tabular data and the original data, so not in the correct order.

Once you convert your table to a data.frame, you can try `merge`? — Roman Luštrik, Dec 23 '12 at 09:51
When I try `merge(data, as.data.frame.matrix(tab))`, I do get all of the results, but the ID column is in the middle – so first I get the tabular data, then the ID column, then the remainder –, and forcing `by="video_id"` in the merge returns `'by' must specify uniquely valid column(s)` as an error. — slhck, Dec 23 '12 at 09:55
You can rearrange the columns. `your.data[, c(2, 1, 3, 4)]` for example. Also, if you want to merge by two not unique columns, you `by.x` and `by.y`. — Roman Luštrik, Dec 23 '12 at 09:57
I see, but thinking about that, my first original approach looks a little cleaner than manually rearranging. I don't quite get why I'd need `by.x` and `by.y` if the column `video_id` is the same (although the table doesn't explicitly contain such a column). — slhck, Dec 23 '12 at 10:06
At this point, I will invite you to post some example data. :) See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Dec 23 '12 at 11:01
Alright – I added a specific example with values that you should be able to simply read in. — slhck, Dec 23 '12 at 11:15

Roman Luštrik · Accepted Answer · 2012-12-23T12:26:35.023

What I did:

data <- read.table(text = "video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763", header = TRUE)

tab <- read.table(text = "video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak", header = TRUE)

tt <- aggregate(keyword ~ video_id, data = tab, FUN = table)
mrg <- merge(data, tt)
mrg

  video_id col1 col2 keyword.bar keyword.baz keyword.blah keyword.foo keyword.yak
1        1  123  412           0           0            0           3           0
2        2  652  633           3           0            0           0           0
3        3  749  144           0           1            2           0           0
4        4 1738 1763           0           0            0           0           3

Now we have "weird" column names. If you can't live with that, here's a "hack".

tmp <- data.frame(mrg$keyword)
mrg$keyword <- NULL
mrg <- cbind(mrg, tmp)

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3

EDIT

Here's another way:

tt2 <- tapply(X = tab$keyword, INDEX = tab$video_id, FUN = table, simplify = FALSE)
video_ajdi <- factor(names(tt2))
tt2 <- cbind(video_id = video_ajdi, do.call("rbind", tt2))
mrg <- merge(data, tt2)
mrg

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3

EDIT2

And another one:

tt <- aggregate(keyword ~ video_id, data = tab, FUN = table)
tt3 <- do.call("cbind", tt)
merge(data, tt3)

Sven Hohenstein · Answer 2 · 2012-12-23T12:27:31.880

Here's an approach including cbind and match:

The data:

df1 <- read.table(text="video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak", header = TRUE)

tab <- table(df1[c("video_id", "keyword")])

        keyword
video_id bar baz blah foo yak
       1   0   0    0   3   0
       2   3   0    0   0   0
       3   0   1    2   0   0
       4   0   0    0   0   3


df2 <- read.table(text="video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763", header = TRUE)

Here's the solution:

cbind(df2, tab[match(df2$video_id, rownames(tab)), ])

The result:

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3

Merge frequency table with existing data frame based on ID

What I have:

What I need:

What I've tried:

2 Answers2