1

I have a flat file of tweets and would like to aggregate their properties by user.

e.g.

user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8

Which I would like to transform into:

user1, hashtag1, hashtag2, hashtag3, hashtag4
user2, hashtag5, hashtag6, hashtag7, hashtag8 

Is there an elegant ways to do this?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Mike Jensen
  • 63
  • 1
  • 6
  • Added 'code' formation to show how the "files" were entered. – IRTFM May 23 '12 at 12:42
  • what happens when there are a different number of hashtags per user? Or are there always going to be four hashtags per person? Does the ordreing within columns matter? – Chase May 23 '12 at 13:23

3 Answers3

3

Unless the number of hashtags per user will always be the same, I'd aggregate the results into a list. Each element of the list will be a (possibly variable-length) vector of one user's hashtags.

# Read in your example data
df <- read.table(text="user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8", sep=",", header=FALSE, stringsAsFactors=FALSE)


lapply(split(df[-1], df[1]), function(X) unname(unlist(X)))
# $user1
# [1] " hashtag1"  " hashtag3"  " hashtag2 " " hashtag4 "
# 
# $user2
# [1] " hashtag5"  " hashtag7"  " hashtag6 " " hashtag8" 
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • I would take this approach as well. It is unlikely that twitter users are all going to have the same number of hashtags. The list is the data structure OP is after, not data frames. – jthetzel May 23 '12 at 13:48
  • Thanks for the quick feedback! You're right about the differing number of tags per user. One question: lapply produces a 'list' and converting it to a data frame produces the error, "arguments imply differing number of rows". Any ideas on how to tackle this? Sorry, I'm quite the novice. – Mike Jensen May 23 '12 at 15:19
  • @MikeJensen -- You'll do best to leave your data in a list. A data.frame is really designed to hold tabular data, in which each column is a variable, and each row is an observation or individual. Your data don't really fit that pattern, and the error message you report is kind of trying to tell you that! – Josh O'Brien May 23 '12 at 15:27
  • Thanks for the help everyone! I wanted to aggregate these tags to look at clusters of tags for individuals. The matrix() command does a messy job of coercing the results to a df, but exported to a text editor, it is more or less feasible to clean it up. – Mike Jensen May 24 '12 at 11:52
1

You're looking for a reshape. Either the reshape command (which has painful syntax, but basically you want to go from "long" to "wide" with "user" as your id variable) or the reshape2 package with melt followed by dcast will do what you want.

Alternatively, since it seems the number of hashtags might vary, you could do it using plyr:

> colnames(x) <- c("user","tag1","tag2")
> 
> library(plyr)
> extract.hashtags <- function(x) {
+   x <- subset(x,select=c(-user))
+   mat <- as.matrix(x)
+   dim(mat) <- c(1,length(mat))
+   as.data.frame(mat)
+ }
> ddply(x, .(user), extract.hashtags )
   user       V1       V2       V3       V4
1 user1 hashtag1 hashtag3 hashtag2 hashtag4
2 user2 hashtag5 hashtag7 hashtag6 hashtag8
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
1

One way is to use the aggregate() function. From ?aggregate:

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form

First, read in your data (you should do this in your question in the future to provide a reproducible example, see: How to make a great R reproducible example?):

txt <- "user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8"

x <- read.delim(file = textConnection(txt), header = F, sep = ",", 
        strip.white = T, stringsAsFactors = F)

Then, use aggregate() to split the data into subsets and convert each subset to a 1-dimensional array:

aggregate(x[-1], by = x[1], function(z)
        {
            dim(z) <- c(length(z)) # Change dimensions of z to 1-dimensional array
            z
        })
#      V1     V2.1     V2.2     V3.1     V3.2
# 1 user1 hashtag1 hashtag3 hashtag2 hashtag4
# 2 user2 hashtag5 hashtag7 hashtag6 hashtag8

Edit

This approach only works if all users have the same number of hashtags, which seems unlikely. @Josh O'Brien's answer is the better approach.

Community
  • 1
  • 1
jthetzel
  • 3,603
  • 3
  • 25
  • 38