Stuck again and hoping the more clued up can provide a pointer ;o)
I have a dataset; 3,270 rows of datePublished (2013-04-01:2014-03-31) and domain (coindesk,forbes,mashable,nytimes,reuters,techcrunch,thenextweb & theverge). A copy of which is here)
> df <- read.csv("dplyr_summary_example.csv")
> head(df)
datePublished domain
1 2013-04-01 coindesk
2 2013-04-01 coindesk
3 2013-04-13 coindesk
4 2013-04-15 coindesk
5 2013-04-15 coindesk
Basically df has a row for date/domain pair every time a story was published.
What I want to do is create a new dataframe that looks a bit like (numbers are made up for example)...
datePublished coindeskStories forbesStories... thevergeStories totalStories
2013-04-01 2 1 1 4
2013-04-13 1 1 0 2
2013-04-15 2 0 1 3
So for every date in df I would like a column of total stories for each domain, and finally a total of totals column (totals of totals is easy).
I've been looking at dplyr
and it certainly looks like it can do the job but so far I've not managed to do this in one step.
For each domain and then joining things is fairly straightforward:
daily <- group_by(df,datePublished) # group stories by date
cnt.nytimes <- filter(daily, domain=="nytimes") # filter just the nytimes ones
cnt.nytimes <- summarise(cnt.nytimes,nytimesStories=n()) # give table of stories by date
cnt.mashable <- filter(daily, domain=="mashable")
cnt.mashable <- summarise(cnt.mashable,mashableStories=n())
df.Stories <- full_join(cnt.nytimes,cnt.mashable,by="datePublished") # join cnt. dataframes by datePublished
df.Stories <- arrange(df.Stories,datePublished) #sort by datePublished
df.Stories$totalStories <- apply(df.Stories[c(2:3)],1,sum,na.rm=TRUE) #add a totals column
BUT doing this over each domain then using a join seems a bit inefficient.
Can anyone suggest a simpler route?