1

What I have:

From my original observations …

video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak

… I have a table with frequencies (called tab), and it is displayed in the exact format I want, e.g.

video_id  foo bar baz yak blah
1         4   0   0   0   0
2         0   4   0   0   0
3         0   0   2   0   2
4         0   0   0   4   0

I'd like to merge this table with an existing data frame (called data), based on the ID column. So, for example, it contains two other columns as well:

video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763

What I need:

I need to merge the frequency table and the existing data frame based on the video ID. Note that it is not necessarily sorted, so I can't just cbind them. This is the result I need:

video_id  col1  col2  foo bar baz yak blah
1         123   412   4   0   0   0   0
2         652   633   0   4   0   0   0
3         749   144   0   0   2   0   2
4         1738  1763  0   0   0   4   0

Now, I know I can get a data frame matrix like this:

as.data.frame.matrix(table(…))

But this matrix is missing the video_id column, which is actually displayed when I just view the table. So, how do I go about getting a data frame that still includes the video_id column—or row names, that is?

I need the video_id column to come first in the data frame, then the original columns, and then the tabular data appended, as seen in the example above.

What I've tried:

  • I know I can get the table's row names through rownames(table(…)), and I can get the result I want with

    cbind(data.frame(video_id=rownames(tab)), as.data.frame.matrix(tab))
    

    But this doesn't seem clean (enough) to me.

  • Merging directly with

    merge(data, as.data.frame.matrix(tab))
    

    gives me all the results, but the video_id column is between the tabular data and the original data, so not in the correct order.

slhck
  • 36,575
  • 28
  • 148
  • 201
  • Once you convert your table to a data.frame, you can try `merge`? – Roman Luštrik Dec 23 '12 at 09:51
  • When I try `merge(data, as.data.frame.matrix(tab))`, I do get all of the results, but the ID column is in the middle – so first I get the tabular data, then the ID column, then the remainder –, and forcing `by="video_id"` in the merge returns `'by' must specify uniquely valid column(s)` as an error. – slhck Dec 23 '12 at 09:55
  • You can rearrange the columns. `your.data[, c(2, 1, 3, 4)]` for example. Also, if you want to merge by two not unique columns, you `by.x` and `by.y`. – Roman Luštrik Dec 23 '12 at 09:57
  • I see, but thinking about that, my first original approach looks a little cleaner than manually rearranging. I don't quite get why I'd need `by.x` and `by.y` if the column `video_id` is the same (although the table doesn't explicitly contain such a column). – slhck Dec 23 '12 at 10:06
  • At this point, I will invite you to post some example data. :) See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Dec 23 '12 at 11:01
  • Alright – I added a specific example with values that you should be able to simply read in. – slhck Dec 23 '12 at 11:15

2 Answers2

1

What I did:

data <- read.table(text = "video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763", header = TRUE)

tab <- read.table(text = "video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak", header = TRUE)

tt <- aggregate(keyword ~ video_id, data = tab, FUN = table)
mrg <- merge(data, tt)
mrg

  video_id col1 col2 keyword.bar keyword.baz keyword.blah keyword.foo keyword.yak
1        1  123  412           0           0            0           3           0
2        2  652  633           3           0            0           0           0
3        3  749  144           0           1            2           0           0
4        4 1738 1763           0           0            0           0           3

Now we have "weird" column names. If you can't live with that, here's a "hack".

tmp <- data.frame(mrg$keyword)
mrg$keyword <- NULL
mrg <- cbind(mrg, tmp)

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3

EDIT

Here's another way:

tt2 <- tapply(X = tab$keyword, INDEX = tab$video_id, FUN = table, simplify = FALSE)
video_ajdi <- factor(names(tt2))
tt2 <- cbind(video_id = video_ajdi, do.call("rbind", tt2))
mrg <- merge(data, tt2)
mrg

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3

EDIT2

And another one:

tt <- aggregate(keyword ~ video_id, data = tab, FUN = table)
tt3 <- do.call("cbind", tt)
merge(data, tt3)
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

Here's an approach including cbind and match:

The data:

df1 <- read.table(text="video_id  user_id keyword
1         1       foo
2         1       bar
3         1       baz
4         1       yak
1         2       foo
2         2       bar
3         2       blah
4         2       yak
1         3       foo
2         3       bar
3         3       blah
4         3       yak", header = TRUE)

tab <- table(df1[c("video_id", "keyword")])

        keyword
video_id bar baz blah foo yak
       1   0   0    0   3   0
       2   3   0    0   0   0
       3   0   1    2   0   0
       4   0   0    0   0   3


df2 <- read.table(text="video_id  col1  col2
1         123   412
2         652   633
3         749   144
4         1738  1763", header = TRUE)

Here's the solution:

cbind(df2, tab[match(df2$video_id, rownames(tab)), ])

The result:

  video_id col1 col2 bar baz blah foo yak
1        1  123  412   0   0    0   3   0
2        2  652  633   3   0    0   0   0
3        3  749  144   0   1    2   0   0
4        4 1738 1763   0   0    0   0   3
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168