1

Given a data frame representing messages like this:

df <- structure(list(message.id = c(123L, 456L), user.id = c(999L, 888L), 
      message.date = structure(c(1310950467, 1311119810), class = c("POSIXct", 
      "POSIXt"), tzone = "")), .Names = c("messageid", "user.id", 
      "message.date"), row.names = c(NA, -2L), class = "data.frame")

head(df)
message.id   user.id    message.date         
123         999       2011-07-17 17:54:27
456         888       2011-07-19 16:56:50

How would you plot the daily average number of messages per user assuming that some users would have a lot of messages and others very few (e.g. pareto distribution).

Thanks.

amh
  • 690
  • 1
  • 8
  • 19
  • How are you calculating average? Is it a function of max(message.date) - min(message.date) or is it just # of days where a message occurs (ie length(unique(message.date)))? – screechOwl Apr 22 '12 at 15:05
  • @amh: Is your question just "How can I plot the daily average number of messages per user?" (if it is, then Tyler's and Sacha's answer concisely demonstrate this)? Or, is your question soliciting opinions about how to visually represent this data? I'm unsure why the statement "... assuming that some users would have a lot of messages and others very few (e.g. pareto distribution)" is relevant information if your goal is simply to plot daily user averages. – Jubbles Apr 22 '12 at 16:50
  • @screechOwl: good question, the average should be based on max(message.date) - min(message.date) – amh Apr 22 '12 at 19:18
  • See also: http://stackoverflow.com/questions/10007877/calculating-hourly-averages-from-a-multi-year-timeseries – Paul Hiemstra Apr 22 '12 at 21:48

3 Answers3

3

Your example is quite small to really work with, so I simulated a larger data frame that should be the same:

set.seed(1)
start <- strptime("2012-01-01 00:00:00",format="%Y-%m-%d %H:%M:%S")
end <- strptime("2012-03-01 00:00:00",format="%Y-%m-%d %H:%M:%S")

df <- data.frame(
  message.id = 1:1000,
  user.id = sample(1:10,1000,TRUE,prob=1:10),
  message.date = seq(start,end,length=1000))

Then first we need to extract the dates as dates (instead of POSIXt):

df$date <- as.Date(df$message.date)

Then I think we can use plyr to compute the average number of messages per user per day as followed:

library("plyr")
df2 <- ddply(df,.(user.id),summarize,AvPerDay = mean(sapply(seq(min(df$date),max(df$date),by="day"),function(x)sum(date==x))))

The new data frame df2 gives me:

   user.id  AvPerDay
1        1 0.3278689
2        2 0.6229508
3        3 0.9836066
4        4 1.1311475
5        5 1.3442623
6        6 1.8524590
7        7 1.8032787
8        8 2.8032787
9        9 2.5081967
10      10 3.0163934

To plot it you could just make a barplot:

barplot(df2$user.id,df2$AvPerDay)
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • Thanks. It seems that this approach calculates the mean based on the number of days when that user posted something. However, the average should be based on max(message.date) - min(message.date). – amh Apr 22 '12 at 19:19
1

Sacha's is better but I had just finished when I saw his answer. Here's a possible base approach:

#Make my own data
set.seed(15)
df <- data.frame(messageid= sample(1:1000, 1000), user.id = 
    rep(901:925, each=40), message.date = sample(seq(Sys.time(), 
    length.out = 10000, by = "hours"), 1000, replace=T))

#Make a date column
df$date <- unlist(strsplit(as.character(df$message.date), " "))[c(T, F)]

#split on user id
pidLIST <- split(df, df[, 'user.id'])
#sum and get an average by date
df2 <- data.frame(user.id=as.factor(names(pidLIST)), 
    aveMESS = sapply(seq_along(pidLIST), 
    function(i) mean(aggregate(user.id~date, pidLIST[[i]], length)[, 2])))

plot(df2)

as you can tell I don't often work with dates.

PS It's helpful when you provide a minimal reproducible example if it's large enough to work with. Both Sacha and I had to recreate our own data set.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

Trying a different approach, I tried this plot: a boxplot for each day showing the distribution of user-message counts, and a line connecting the mean number of messages per user. Here's the target plot:

disribution and mean of user messages per day

I start by generating data using the method by @Sacha Epskamp. I generate a large dataset in order to have something for the intended plot

library("ggplot2")
library("lubridate")


# This code from Sacha Eskamp
# http://stackoverflow.com/a/10269840/1290420

# Generate a data set
set.seed(1)
start <- strptime("2012-01-05 00:00:00",
                  format="%Y-%m-%d %H:%M:%S")
end <- strptime("2012-03-05 00:00:00",
                format="%Y-%m-%d %H:%M:%S")

df <- data.frame(message.id = 1:10000,
                 user.id = sample(1:30,10000,
                                 TRUE,
                                 prob=1:30),
                 message.date = seq(start,
                                   end,
                                   length=10000)
                 )

Then I struggle to wrangle the dataframe into a shape suitable for the plot. I am sure that plyr gurus would be able to vastly improve this.

# Clean up the data frame and add a column 
# with combined day-user
df$day <- yday(df$message.date)
df <- df[ df$day!=65, c(2,4) ]
df$day.user <- paste(df$day, df$user.id, sep="-")

# Copy into new data frame with counts for each
# day-user combination
df2 <- aggregate(df, 
                 by=list(df$day, 
                         df$day.user), 
                 FUN="length"
                 )
df2 <- df2[,c(1,2,3)]
names(df2) <- c("day", "user", "count")
df2$user <- gsub(".+-(.+)", "\\1", df2$user)

Then drawing the plot is the easy part:

p <- ggplot(df2,
            aes(x=day,
                y=count))
p <- p + geom_boxplot(aes(group=day), colour="grey80")
p <- p + stat_summary(fun.y=mean, 
                      colour="steelblue", 
                      geom="line",
                      size=1)
p <- p + stat_summary(fun.y=mean, 
                      colour="red", 
                      geom="point",
                      size=3)
p
daedalus
  • 10,873
  • 5
  • 50
  • 71