2

My first time here so I hope I don't break anything... I have a list of lists:

Browse[2]> head(str(mylist))
List of 33
 $ : chr [1:33] "0001" "space" "28" "night_club" ...
 $ : chr [1:33] "0002" "concert" "28" "night_club" ...
 $ : chr [1:31] "0003" "night_club" "24" "martial_arts" ...
 $ : chr [1:31] "0004" "stage" "24" "basketball" ...
 $ : chr [1:43] "0005" "night_club" "16" "concert" ...
 $ : chr [1:43] "0006" "night_club" "16" "concert" ...
 $ : chr [1:39] "0007" "night_club" "22" "concert" ...
 $ : chr [1:39] "0008" "night_club" "22" "concert" ...
 $ : chr [1:31] "0009" "night_club" "46" "martial_arts" ...
 $ : chr [1:31] "0010" "night_club" "46" "martial_arts" ...
 $ : chr [1:41] "0011" "night_club" "17" "martial_arts" ...
 $ : chr [1:41] "0012" "night_club" "17" "martial_arts" ...
 $ : chr [1:29] "0013" "concert" "23" "night_club" ...
 $ : chr [1:29] "0014" "concert" "23" "night_club" ...
 $ : chr [1:25] "0015" "night_club" "26" "concert" ...
 $ : chr [1:31] "0016" "night_club" "42" "concert" ...
 $ : chr [1:31] "0017" "night_club" "42" "concert" ...
 $ : chr [1:31] "0018" "night_club" "25" "wrestling" ...
 $ : chr [1:31] "0019" "night_club" "25" "wrestling" ...
 $ : chr [1:33] "0020" "night_club" "46" "wrestling" ...
 $ : chr [1:33] "0021" "night_club" "46" "wrestling" ...
 $ : chr [1:41] "0022" "concert" "21" "stage" ...
 $ : chr [1:41] "0023" "concert" "21" "stage" ...
 $ : chr [1:55] "0024" "basketball" "8" "concert" ...
 $ : chr [1:55] "0025" "basketball" "8" "concert" ...
 $ : chr [1:37] "0026" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0027" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0028" "night_club" "32" "business_meeting" ...
 $ : chr [1:37] "0029" "night_club" "32" "business_meeting" ...
 $ : chr [1:15] "0030" "night_club" "59" "stage" ...
 $ : chr [1:37] "0031" "stage" "12" "night_club" ...
 $ : chr [1:37] "0032" "stage" "12" "night_club" ...
 $ : chr [1:33] "0033" "night_club" "23" "portrait" ...

I want to turn this list into a wide format data frame where the first column would be each of every inner list first element (i.e. "0001", "0002" etc.) and there would be all possible columns with categories exist in the file: "space", "night_club", "concert", "marital_arts", "wrestling" etc. meaning that I would a very wide data frame that each row will begin with some id (0001,0002,0003 ...) and the columns names would be again all categories in the file: "space", "night_club", "concert", "marital_arts", "wrestling" etc. and for each row, where the category exists for that id, it would populate the value next to the category from the list ("space" -> 28 from the first line for example).

I was trying to construct a normalized data frame with loops and then convert it to a wide format, but as data scales it would be a bad idea:

for (file in files){# iterate over files in folder

    mylist <- strsplit(readLines(file), ":")
    #close(mylist)
    for (elem in mylist){
      dataframe <- data.frame(frameid = numeric(), category = character(), nrow = length(unlist(elem)))
      frameid <- rep.int(elem[[1]], length(elem)-1) 
      categories <- elem[-1:-1]
      dataframe$frameid <- frameid
      dataframe$category <- categories
    }
  }

Reproducible input output example: dput of input:

 list(c("0001", "space", "28", "night_club", "25"), c("0002", 
"concert", "28", "night_club", "26"), c("0003", "night_club", 
"24", "martial_arts", "27"), c("0004", "stage", "24", "basketball", 
"30"))

output:

Dataframe
frameid, cat_space, cat_night_club, cat_concert, cat_martial_arts, cat_stage, cat_basketball
0001, 28, 25, 0, 0, 0, 0
0002, 0, 26, 28, 0, 0, 0
0003, 0, 24, 0, 27, 0, 0
0004, 0, 0, 0, 0, 24, 30
Neuril
  • 183
  • 1
  • 6
  • [See here for some info on how to construct a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Nov 23 '15 at 10:09
  • I have updated the question per your request with the input dput object – Neuril Nov 23 '15 at 11:08
  • For a generic discussion on converting lists to data frames, I would suggest that you have a look at [this article](http://www.r-bloggers.com/converting-a-list-to-a-data-frame/) as well as [this solution](http://jason.bryer.org/TriMatch/docs/as.data.frame.list.html). It's a fairly common task and widely discussed across SO. The whole philosophy boils down to assembling your list elements to in a `data.frame` one by one. – Konrad Nov 23 '15 at 11:10

1 Answers1

1

Here's a possibility. I've created the answer as a function, and commented what is happening at each stage. The basic idea is to:

  1. Create a column of just the first items from each list element.
  2. Create a two-column matrix of the rest of the items. This assumes that the data are nicely paired.
  3. Create a data.frame of these two elements put together.
  4. Use xtabs to convert the output to a wide format. Note that if there are duplicated combinations of "ID" and "var", the values would be added together because of the use of xtabs.

Here's the function:

myFun <- function(inList) {
  ## Extract the first value in each list element
  ID <- vapply(inList, `[`, character(1L), 1)
  ## Convert the remaining elements into a two column matrix, first
  ##   column as variable, second column as value. Bind all list
  ##   elements together to a single 2-column mantrix.
  varval <- do.call(rbind, lapply(inList, function(x) {
    matrix(x[-1], ncol = 2, byrow = TRUE, dimnames = list(NULL, c("var", "val")))
  }))
  ## Create a data.frame where ID is repeated to the same number of rows
  ##   as the matrices found in varval.
  temp <- data.frame(ID = rep(ID, (lengths(inList)-1)/2), varval)
  ## Convert the val columns to numeric
  temp$val <- as.numeric(as.character(temp$val))
  ## Use xtabs to go from a "long" form to a "wide" form
  xtabs(val ~ ID + var, temp)
}

Here it is applied to your sample data (assuming your data is called "L"):

myFun(L)
#       var
# ID     basketball concert martial_arts night_club space stage
#   0001          0       0            0         25    28     0
#   0002          0      28            0         26     0     0
#   0003          0       0           27         24     0     0
#   0004         30       0            0          0     0    24
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • thanks. Looks good. I will try it on the entire set and watch the performance. – Neuril Nov 23 '15 at 11:26
  • one thing you have not mentioned is how to convert the normalized data frame to a wide format data frame, but I guess that's not hard – Neuril Nov 23 '15 at 11:29
  • @Neuril??? Isn't the output wide? (See point # 4). Or are you referring to the fact that it's a matrix rather than a `data.frame`? Try `as.data.frame.matrix(myFun(L))` if that's the concern. – A5C1D2H2I1M1N2O1R2T1 Nov 23 '15 at 11:37