0

I'm relatively new to R, so I don't think as clearly in vector-space as more-experienced users do. I have a data frame that's formatted like so:

              metric timestamp               value           tag1     tag2   tag3 tag4 tag5 tag6 tag7 tag8 tag9 tag10
1 dummy.random.unif 1367848802  0.9936670064926147 host=localhost blah=foo   NA   NA   NA   NA   NA   NA   NA    NA
2 dummy.random.unif 1367848822 0.19621700048446655 host=localhost blah=bar   NA   NA   NA   NA   NA   NA   NA    NA
3 dummy.linear      1367848842                97.6 shmoo=whatever NA         NA   NA   NA   NA   NA   NA   NA    NA
4 dummy.random.unif 1367848862  0.3171229958534241 host=localhost blah=foo   NA   NA   NA   NA   NA   NA   NA    NA
5 dummy.linear      1367848882                97.7 shmoo=whatever NA         NA   NA   NA   NA   NA   NA   NA    NA
6 dummy.random.unif 1367848902  0.2197140008211136 host=localhost blah=foo   NA   NA   NA   NA   NA   NA   NA    NA

As you can see, the columns tag1:tag10 contain key-value pairs. But not always the same keys, and not always the same number of keys. I want to convert this data frame to something more like this, which is more convenient for consumption:

              metric timestamp               value   tag.host tag.blah tag.shmoo 
1 dummy.random.unif 1367848802  0.9936670064926147  localhost      foo        NA
2 dummy.random.unif 1367848822 0.19621700048446655  localhost      bar        NA
3 dummy.linear      1367848842                97.6         NA       NA  whatever
4 dummy.random.unif 1367848862  0.3171229958534241  localhost      foo        NA
5 dummy.linear      1367848882                97.7         NA       NA  whatever
6 dummy.random.unif 1367848902  0.2197140008211136  localhost      foo  whatever

Now I know I could do this procedurally, but it would be clunky, and I've heard that the correct way to use R is to think about operations on entire vectors (rather than looping over them). I've spent a few hours trying to figure out the right permutation of do.call, daply, strsplit, and so on, but I'm not getting anywhere.

What is a clean, R-ish way to solve this problem?

danslimmon
  • 213
  • 1
  • 2
  • 6
  • Please read this on how to make a good reproducible example and update your question: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Dason May 06 '13 at 22:04

1 Answers1

3

If DF is the input data frame then convert it from wide to long form using melt, remove NA rows, split the tags into tag.name and tag.value and reshape it back to wide as indicated:

library(reshape2)
m <- na.omit(melt(DF, id = 1:3, value.name = "tag"))
m.tr <- transform(m, tag.name = sub("=.*", "", tag), 
                     tag.value = sub(".*=", "", tag))
dcast(metric + timestamp + metric + value ~ tag.name, data=m.tr, 
      value.var="tag.value")

Swap metric and timestamp in the last line if its desired to have the output in timestamp order. The result of the last line is:

             metric  timestamp     value blah      host    shmoo
1      dummy.linear 1367848842 97.600000 <NA>      <NA> whatever
2      dummy.linear 1367848882 97.700000 <NA>      <NA> whatever
3 dummy.random.unif 1367848802  0.993667  foo localhost     <NA>
4 dummy.random.unif 1367848822  0.196217  bar localhost     <NA>
5 dummy.random.unif 1367848862  0.317123  foo localhost     <NA>
6 dummy.random.unif 1367848902  0.219714  foo localhost     <NA>
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Holy crap. I hope one day if I work hard enough I will be as good at this mode of thinking as you are. Thanks. – danslimmon May 06 '13 at 22:33