Combine multiple observations in R

Question

I have a flat file of tweets and would like to aggregate their properties by user.

e.g.

user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8

Which I would like to transform into:

user1, hashtag1, hashtag2, hashtag3, hashtag4
user2, hashtag5, hashtag6, hashtag7, hashtag8

Is there an elegant ways to do this?

Added 'code' formation to show how the "files" were entered. — IRTFM, May 23 '12 at 12:42
what happens when there are a different number of hashtags per user? Or are there always going to be four hashtags per person? Does the ordreing within columns matter? — Chase, May 23 '12 at 13:23

Josh O'Brien · Accepted Answer · 2012-05-23T13:43:54.520

3

Unless the number of hashtags per user will always be the same, I'd aggregate the results into a list. Each element of the list will be a (possibly variable-length) vector of one user's hashtags.

# Read in your example data
df <- read.table(text="user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8", sep=",", header=FALSE, stringsAsFactors=FALSE)


lapply(split(df[-1], df[1]), function(X) unname(unlist(X)))
# $user1
# [1] " hashtag1"  " hashtag3"  " hashtag2 " " hashtag4 "
# 
# $user2
# [1] " hashtag5"  " hashtag7"  " hashtag6 " " hashtag8"

edited May 23 '12 at 13:43

answered May 23 '12 at 13:36

Josh O'Brien

159,210
26
366
455

I would take this approach as well. It is unlikely that twitter users are all going to have the same number of hashtags. The list is the data structure OP is after, not data frames. – jthetzel May 23 '12 at 13:48
Thanks for the quick feedback! You're right about the differing number of tags per user. One question: lapply produces a 'list' and converting it to a data frame produces the error, "arguments imply differing number of rows". Any ideas on how to tackle this? Sorry, I'm quite the novice. – Mike Jensen May 23 '12 at 15:19
@MikeJensen -- You'll do best to leave your data in a list. A data.frame is really designed to hold tabular data, in which each column is a variable, and each row is an observation or individual. Your data don't really fit that pattern, and the error message you report is kind of trying to tell you that! – Josh O'Brien May 23 '12 at 15:27
Thanks for the help everyone! I wanted to aggregate these tags to look at clusters of tags for individuals. The matrix() command does a messy job of coercing the results to a df, but exported to a text editor, it is more or less feasible to clean it up. – Mike Jensen May 24 '12 at 11:52

Ari B. Friedman · Answer 2 · 2012-05-23T13:31:11.710

You're looking for a reshape. Either the reshape command (which has painful syntax, but basically you want to go from "long" to "wide" with "user" as your id variable) or the reshape2 package with melt followed by dcast will do what you want.

Alternatively, since it seems the number of hashtags might vary, you could do it using plyr:

> colnames(x) <- c("user","tag1","tag2")
> 
> library(plyr)
> extract.hashtags <- function(x) {
+   x <- subset(x,select=c(-user))
+   mat <- as.matrix(x)
+   dim(mat) <- c(1,length(mat))
+   as.data.frame(mat)
+ }
> ddply(x, .(user), extract.hashtags )
   user       V1       V2       V3       V4
1 user1 hashtag1 hashtag3 hashtag2 hashtag4
2 user2 hashtag5 hashtag7 hashtag6 hashtag8

score 1 · Answer 3 · edited May 23 '17 at 10:09

One way is to use the aggregate() function. From ?aggregate:

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form

First, read in your data (you should do this in your question in the future to provide a reproducible example, see: How to make a great R reproducible example?):

txt <- "user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8"

x <- read.delim(file = textConnection(txt), header = F, sep = ",", 
        strip.white = T, stringsAsFactors = F)

Then, use aggregate() to split the data into subsets and convert each subset to a 1-dimensional array:

aggregate(x[-1], by = x[1], function(z)
        {
            dim(z) <- c(length(z)) # Change dimensions of z to 1-dimensional array
            z
        })
#      V1     V2.1     V2.2     V3.1     V3.2
# 1 user1 hashtag1 hashtag3 hashtag2 hashtag4
# 2 user2 hashtag5 hashtag7 hashtag6 hashtag8

Edit

This approach only works if all users have the same number of hashtags, which seems unlikely. @Josh O'Brien's answer is the better approach.

Combine multiple observations in R

3 Answers3