Pivoting a CSV file using R

Question

I have a file that looks like this:

                 type          created_at repository_name
1         IssuesEvent 2012-03-11 06:48:31       bootstrap
2         IssuesEvent 2012-03-11 06:48:31       bootstrap
3         IssuesEvent 2012-03-11 06:48:31       bootstrap
4         IssuesEvent 2012-03-11 06:52:50       bootstrap
5         IssuesEvent 2012-03-11 06:52:50       bootstrap
6         IssuesEvent 2012-03-11 06:52:50       bootstrap
7   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
8   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
9   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
10        IssuesEvent 2012-03-11 07:03:58       bootstrap
11        IssuesEvent 2012-03-11 07:03:58       bootstrap
12        IssuesEvent 2012-03-11 07:03:58       bootstrap
13         WatchEvent 2012-03-11 07:15:44       bootstrap
14         WatchEvent 2012-03-11 07:15:44       bootstrap
15         WatchEvent 2012-03-11 07:15:44       bootstrap
16         WatchEvent 2012-03-11 07:18:45        hogan.js
17         WatchEvent 2012-03-11 07:18:45        hogan.js
18         WatchEvent 2012-03-11 07:18:45        hogan.js

The dataset that I'm working with can be accessed on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv.

I want to create a table that has a column for each entry in the "repository_name" column (e.g. bootstrap, hogan.js). In that column I need to have the data from the "type" column that corresponds to that entry (i.e. only rows form the current "type" column that also has the value "bootstrap" in the current "repository_name" column should fall under the new "bootstrap" column). Hence:

Time stamps is just for ordering and do not need to by synchronized across the row (in fact they can be deleted, as the data is already sorted according to timestamps)
Even if "IssuesEvent" is repeated 10x I need to retain all of these, since I will be doing sequence analysis using the R package TraMineR
Columns can be of unequal length
There is no relationship between the columns for different repos ("repository_name")

In other words, I would want a table that looks something like this:

     bootstrap            hogan.js
1    IssuesEvent          PushEvent
2    IssuesEvent          IssuesEvent
3    OssueCommentEvent    WatchEvent

How can I accomplish this in R?

Some of my failed attempts using the reshape package can be found on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R.

if you give us a [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by pasting the results of `dput(head(yourDataHere))`, you'll be much more likely to get an answer. Also, in your reorganized data - is there any relationship between the two columns? i.e. does "IssuesEvent" and "PushEvent" in the first row relate to the same thing? Or do you simply want to create two columns, where the first occurrence of each shows up on the first row, second occurrence on the 2nd row, etc? — Chase, Aug 08 '12 at 21:58
I'm not expert in `plyr` and `melt, recast`, but I'd start mucking with those tools. Just one side note: `unique(mydata$repositoryname)` will give you the column names (and number of columns) for your new dataframe. — Carl Witthoft, Aug 08 '12 at 22:28
Thanks! I updated my question to clarify the requirements for the output I'm seeking. I had problems pasting a usable reproducible, so I pasted a link to my dataset instead: https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv — histelheim, Aug 10 '12 at 00:54
The CSV I'm using can be found here: https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv The query I'm using to generate this CSV form BigQuery can be found in this post: http://stackoverflow.com/questions/11875929/how-to-get-several-columns-from-bigquery/11893118#11893118 — histelheim, Aug 10 '12 at 01:42

score 5 · Answer 1 · answered Aug 08 '12 at 22:29

5

I just joined stackoverflow; hopefully my answer is somewhat useful.

By table, I assume you mean that you want a data frame. However, it seems unlikely that columns would be of equal length, and it looks like rows wouldn't have much meaning anyway. Maybe a list would be better?

Here's a messy solution:

names <- unique(olddataframe$repository_name)
results <- sapply(1:length(names), function(j){
    sapply(which(olddataframe$repository_name == names[j]), function(i){
        olddataframe$type[i]
   )
})
names(results) <- names
results

answered Aug 08 '12 at 22:29

mengeln

331
1
3

Thanks. I tried running your code, but when I do I get the following error message: Error: unexpected ')' in: " olddataframe$type[i] )" > }) Error: unexpected '}' in " }" > names(results) <- names Error in names(results) <- names : object 'results' not found > results Error: object 'results' not found – histelheim Aug 10 '12 at 01:01

flodel · Accepted Answer · 2012-08-10T01:31:06.323

Your sample data:

data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent", 
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50", 
"2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44", 
"2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type", 
"created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
-18L))

I gather from your expected output that you want only one type when it shows up multiple times for the same created_at value, in other words you want to remove duplicates:

data <- unique(data)

Then, to extract all type entries per repository_name in the order they appear, you can simply use:

data.split <- split(data$type, data$repository_name)
data.split
# $bootstrap
# [1] IssuesEvent       IssuesEvent       IssueCommentEvent
# [4] IssuesEvent       WatchEvent       
# Levels: IssueCommentEvent IssuesEvent WatchEvent
# 
# $hogan.js
# [1] WatchEvent
# Levels: IssueCommentEvent IssuesEvent WatchEvent

It returns a list which is the R data structure of choice for a collection of vectors with different lengths.

Edit: Now that you have provided an example of your output data, it has become more apparent that your expected output is indeed a data.frame. You can convert the list above into a data.frame padded with NAs using the following function:

list.to.df <- function(arg.list) {
   max.len  <- max(sapply(arg.list, length))
   arg.list <- lapply(arg.list, `length<-`, max.len)
   as.data.frame(arg.list)
}

df.out <- list.to.df(data.split)
df.out
#           bootstrap   hogan.js
# 1       IssuesEvent WatchEvent
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>

You can then save that to a file using

write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)

to get the exact same output format as the one you published on github.

Thanks! However, I want all instances of "type", even if it is just a list of 10x "PushEvent". This is so because I will be using this for sequence analysis, and repeating instances of type is perfectly interesting in itself. — histelheim, Aug 10 '12 at 01:00
No problem, you can ignore the `data <- unique(data)` step. I have also edited my answer so the output matches the one you published recently. — flodel, Aug 10 '12 at 01:25

A5C1D2H2I1M1N2O1R2T1 · Answer 3 · 2012-08-09T05:23:10.437

Using @flodel's data object, you can also try aggregate(), but with many event types, this would quickly become unreadable:

aggregate(list(Type = unique(data)$type), 
          list(Repository = unique(data)$repository_name), 
          function(x) paste0(x))
#   Repository                                                                 Type
# 1  bootstrap IssuesEvent, IssuesEvent, IssueCommentEvent, IssuesEvent, WatchEvent
# 2   hogan.js                                                           WatchEvent

You can also try reshape() and do some trickery with t() (transpose), as below.

temp = unique(data)
temp = reshape(temp, direction = "wide", 
               idvar="repository_name", timevar="created_at")
# If you want to keep the times, remove `row.names=NULL` below
temp1 = data.frame(t(temp[-1]), row.names=NULL)
names(temp1) = t(temp[1])
temp1
#           bootstrap   hogan.js
# 1       IssuesEvent       <NA>
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>
# 6              <NA> WatchEvent

But, I find that all of those NAs are obnoxious; I would say that @flodel's answer is the most direct and probably the most useful in the long run (that is, not knowing exactly what you want to do once you get the data in this form).

Update (more trickery)

(Actually, this is a "SO is perfect for procrastination" moment)

My final (terribly inefficient) answer is as follows.

Proceed as above, but drop the date/time stuff, and convert from factors to characters.

# Using @flodel's data
temp1 = unique(data)[-2]
# Remove the factors
temp1[sapply(temp1, is.factor)] = lapply(temp1[sapply(temp1, is.factor)], 
                                         as.character)
# Split and unlist your data
temp2 = split(temp1[-c(2:3)], temp1$repository_name)
temp3 = sapply(temp2, as.vector)

rbind() and cbind() will "recycle" objects of different lengths to make them the same length, but we don't want that. So, we need to force R to believe that the lengths are the same. So, find out the max length. While we're at it, extract a cleaned up version of the names in the temp3 object.

# What is the max number of rows we need?
LEN = max(sapply(temp3, length))
# What are the names we want for our columns?
NAMES = gsub(".type", "", names(temp3))

Now, extract the items from temp3 into your workspace, and make sure they are both the same length.

# Use assign to unlist the vectors to the workspace
for (i in 1:length(temp3)) assign(NAMES[i], temp3[[i]])
# Make sure they have the same lengths
length(hogan.js) = LEN
length(bootstrap) = LEN

Finally, use cbind() to put your data together.

# Use cbind to put these together
data.frame(cbind(bootstrap, hogan.js))
#           bootstrap   hogan.js
# 1       IssuesEvent WatchEvent
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>

Thanks! This looks awesome, but I have to run and will have to review this tomorrow. =( — histelheim, Aug 10 '12 at 01:03

Pivoting a CSV file using R

3 Answers3

Update (more trickery)

(Actually, this is a "SO is perfect for procrastination" moment)

Linked