1

This is a follow up question to Pivoting a CSV file using R.

In that question, I wanted to cut up a single column (type) into several columns based on the values in column (repository_name). The following input data was used.

                 type          created_at repository_name
1         IssuesEvent 2012-03-11 06:48:31       bootstrap
2         IssuesEvent 2012-03-11 06:48:31       bootstrap
3   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
4   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
5   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
6        IssuesEvent 2012-03-11 07:03:58        bootstrap
7         WatchEvent 2012-03-11 07:18:45        hogan.js
8         WatchEvent 2012-03-11 07:18:45        hogan.js
9         WatchEvent 2012-03-11 07:18:45        hogan.js
10   IssueCommentEvent 2012-03-11 07:03:57      bootstrap

The full CSV file is available on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/all_events.csv.

Here is a dput() of the first 30 rows of the CSV:

structure(list(type = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 2L, 
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 
1L, 4L, 4L, 4L, 2L, 2L, 2L), .Label = c("ForkEvent", "IssueCommentEvent", 
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L), .Label = c("2012-03-11 06:48:31", 
"2012-03-11 06:52:50", "2012-03-11 07:03:57", "2012-03-11 07:03:58", 
"2012-03-11 07:15:44", "2012-03-11 07:18:45", "2012-03-11 07:19:01", 
"2012-03-11 07:23:56", "2012-03-11 07:32:43", "2012-03-11 07:38:52"
), class = "factor"), repository_name = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("bootstrap", 
"hogan.js", "twemproxy"), class = "factor")), .Names = c("type", 
"created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
-30L))

That question was well answered by @flodel who proposed this code.

data.split <- split(events.raw$type, events.raw$repository_name)
data.split

list.to.df <- function(arg.list) {
  max.len  <- max(sapply(arg.list, length))
  arg.list <- lapply(arg.list, `length<-`, max.len)
  as.data.frame(arg.list)
}

df.out <- list.to.df(data.split)
df.out

However, now I would like to sort the list so that events (type) for each repo (repository_name) are ordered in a column each per month (as extracted from the "created_at" column) as such:

    bootstrap_2012_03   bootstrap_2012_04    hogan.js_2012_03
1    IssuesEvent          PushEvent          PushEvent
2    IssuesEvent          IssuesEvent        IssuesEvent
3    OssueCommentEvent    WatchEvent         IssuesEvent

Some other assumptions are:

  • Time stamps is just for ordering and do not need to by synchronized across the row
  • Even if "IssuesEvent" is repeated 10x I need to retain all of these, since I will be doing sequence analysis using the R package TraMineR
  • Columns can be of unequal length
  • There is no relationship between the columns for different repos ("repository_name")
  • Data for different months of the same repository are completely independent

How can I accomplish this in R?

Community
  • 1
  • 1
histelheim
  • 4,938
  • 6
  • 33
  • 63
  • 3
    When you asked your earlier question, it was also suggested to [provide a reproducible example of your data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Your data in this format isn't very easy to even copy and paste into R without some extra work by potential respondents. – A5C1D2H2I1M1N2O1R2T1 Aug 26 '12 at 16:21
  • Forgot that. The file I'm working with can be found here: https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/all_events.csv – histelheim Aug 26 '12 at 16:24
  • 3
    I suggest you use `dput()` to paste your sample data into the question. – Andrie Aug 26 '12 at 16:26

1 Answers1

3

Instead of splitting by the repository_name column, first create a new column that combines repository_name and the month:

events.raw$month      <- format(as.Date(events.raw$created_at), "%Y_%m")
events.raw$repo.month <- paste(events.raw$repository_name,
                               events.raw$month, sep = "_")

head(events)
#          type          created_at repository_name   month        repo.month
# 1 IssuesEvent 2012-03-11 06:48:31       bootstrap 2012_03 bootstrap_2012_03
# 2 IssuesEvent 2012-03-11 06:48:31       bootstrap 2012_03 bootstrap_2012_03
# 3 IssuesEvent 2012-03-11 06:48:31       bootstrap 2012_03 bootstrap_2012_03
# 4 IssuesEvent 2012-03-11 06:52:50       bootstrap 2012_03 bootstrap_2012_03
# 5 IssuesEvent 2012-03-11 06:52:50       bootstrap 2012_03 bootstrap_2012_03
# 6 IssuesEvent 2012-03-11 06:52:50       bootstrap 2012_03 bootstrap_2012_03

Then use the same method I suggested last time:

data.split <- split(events.raw$type, events.raw$repo.month)

list.to.df <- function(arg.list) {
  max.len  <- max(sapply(arg.list, length))
  arg.list <- lapply(arg.list, `length<-`, max.len)
  as.data.frame(arg.list)
}

df.out <- list.to.df(data.split)
head(df.out)
#    bootstrap_2012_03 hogan.js_2012_03 twemproxy_2012_03
# 1        IssuesEvent       WatchEvent        WatchEvent
# 2        IssuesEvent       WatchEvent        WatchEvent
# 3        IssuesEvent       WatchEvent        WatchEvent
# 4        IssuesEvent             <NA>              <NA>
# 5        IssuesEvent             <NA>              <NA>
# 6        IssuesEvent             <NA>              <NA>
flodel
  • 87,577
  • 21
  • 185
  • 223