1

Stuck again and hoping the more clued up can provide a pointer ;o)

I have a dataset; 3,270 rows of datePublished (2013-04-01:2014-03-31) and domain (coindesk,forbes,mashable,nytimes,reuters,techcrunch,thenextweb & theverge). A copy of which is here)

> df <- read.csv("dplyr_summary_example.csv")
> head(df)
  datePublished  domain
1 2013-04-01     coindesk
2 2013-04-01     coindesk
3 2013-04-13     coindesk
4 2013-04-15     coindesk
5 2013-04-15     coindesk

Basically df has a row for date/domain pair every time a story was published.

What I want to do is create a new dataframe that looks a bit like (numbers are made up for example)...

datePublished  coindeskStories  forbesStories... thevergeStories totalStories
2013-04-01     2                1                1               4 
2013-04-13     1                1                0               2
2013-04-15     2                0                1               3

So for every date in df I would like a column of total stories for each domain, and finally a total of totals column (totals of totals is easy).

I've been looking at dplyr and it certainly looks like it can do the job but so far I've not managed to do this in one step.

For each domain and then joining things is fairly straightforward:

daily        <- group_by(df,datePublished) # group stories by date

cnt.nytimes  <- filter(daily, domain=="nytimes")  # filter just the nytimes ones
cnt.nytimes  <- summarise(cnt.nytimes,nytimesStories=n()) # give table of stories by date

cnt.mashable <- filter(daily, domain=="mashable")
cnt.mashable <- summarise(cnt.mashable,mashableStories=n())

df.Stories   <- full_join(cnt.nytimes,cnt.mashable,by="datePublished") # join cnt. dataframes by datePublished
df.Stories   <- arrange(df.Stories,datePublished) #sort by datePublished

df.Stories$totalStories <- apply(df.Stories[c(2:3)],1,sum,na.rm=TRUE) #add a totals column

BUT doing this over each domain then using a join seems a bit inefficient.

Can anyone suggest a simpler route?

BarneyC
  • 529
  • 4
  • 17
  • David - more than likely parts of it are duplicate, as with anything R there seems to be a number of methods to solve a problem. Far more helpful than merely marking as duplicate (especially when effort has obviously been made) would be to also post a link through to those answers. – BarneyC May 06 '15 at 12:25

3 Answers3

5

What about reshape2::dcast

require(reshape2)
res <- dcast(df, datePublished ~ domain, value.var = "domain", fun.aggregate = length)

result:

> head(res)
  datePublished coindesk forbes mashable nytimes reuters techcrunch thenextweb theverge
1    2013-04-01        2      2        0       0       0          1          0        2
2    2013-04-02        0      1        1       0       0          0          0        0
3    2013-04-03        0      3        1       0       0          2          0        0
4    2013-04-04        0      0        0       0       0          1          1        1
5    2013-04-05        0      1        0       0       0          1          1        1
6    2013-04-07        0      1        0       1       0          1          0        0

Comment: If you want datePublished as Date instead of factor use

df$datePublished <- as.Date(as.character(df$datePublished))

right after read.csv

Rentrop
  • 20,979
  • 10
  • 72
  • 100
5

To change to wide format you need to use tidyr in addition to dplyr. Something like

library(dplyr)
library(tidyr)

df %>% 
    group_by(datePublished, domain) %>%
    summarise(nstories = n()) %>%
    spread(domain, nstories)
konvas
  • 14,126
  • 2
  • 40
  • 46
2

Why not just use ?aggregate and ?summary?

I could not download your data. However this might help you:

set.seed(12)
n <- 10000
date <- sample(1:100, n, replace=T)
type <- sample(letters[1:5], n, replace=T)
sample <- data.frame(date=date, type=type)

temp <- sample[date==1,]
aggregate(type ~ date, data=sample, FUN=summary)
user1965813
  • 671
  • 5
  • 16