1

I would like to know how to rearrange the source data (table) in order to output the desired table using R or SQL, which are displayed below.

Since looping is very slow in R, and my dataset is quite large... it's not preferred to have too much looping in the script. The efficiency is important.

Source data table:

Date    | Country | ID | Fruit  | Favorite | Money
20120101  US        1    Apple     Book      100
20120101  US        2    Orange    Knife     150
20120101  US        3    Banana    Watch     80
20120101  US        4    Melon     Water     90
20120102  US        1    Apple     Phone     120
20120102  US        2    Apple     Knife     130
20120102  US        3    Banana    Watch     100           
.....     ......    ..   .....     ......    ......

Output table:

Date    | Country | Field   | ID 1 | ID 2  | ID 3  | ID 4
20120101  US        Fruit     Apple  Orange  Banana  Melon
20120101  US        Favorite  Book   Knife   Watch   Water
20120101  US        Money     100    150     80      90
20120102  US        Fruit     Apple  Apple   Banana  N.A.
....      ....      ....      ....   ....    ....    ....
Community
  • 1
  • 1
C.T.
  • 95
  • 1
  • 3
  • What have you tried so far? Please show us your code. Look at [this excellent post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips on how to make your code reproducible. – SlowLearner Jul 18 '13 at 09:14
  • The code I tried is put here. I used loops twice to produce it. https://gist.github.com/anonymous/bfa8b229622555ba8a77 – C.T. Jul 18 '13 at 09:54

1 Answers1

0

Here is an approach in R, using your sample data:

x <- cbind(mydf[, c("Date", "Country", "ID")], 
           stack(mydf[, c("Fruit", "Favorite", "Money")]))
reshape(x, direction = "wide", idvar = c("Date", "Country", "ind"), timevar="ID")
#        Date Country      ind values.1 values.2 values.3 values.4
# 1  20120101      US    Fruit    Apple   Orange   Banana    Melon
# 5  20120102      US    Fruit    Apple    Apple   Banana     <NA>
# 8  20120101      US Favorite     Book    Knife    Watch    Water
# 12 20120102      US Favorite    Phone    Knife    Watch     <NA>
# 15 20120101      US    Money      100      150       80       90
# 19 20120102      US    Money      120      130      100     <NA>

To round up with other options, here's a melt + dcast approach (which can be taken from "data.table" or "reshape2") and a "dplyr" + "tidyr" approach.

library(data.table)
dcast(
  suppressWarnings(
    melt(as.data.table(mydf), c("Date", "Country", "ID"))),
    ... ~ ID, value.var = "value")
#        Date Country variable     1      2      3     4
# 1: 20120101      US    Fruit Apple Orange Banana Melon
# 2: 20120101      US Favorite  Book  Knife  Watch Water
# 3: 20120101      US    Money   100    150     80    90
# 4: 20120102      US    Fruit Apple  Apple Banana    NA
# 5: 20120102      US Favorite Phone  Knife  Watch    NA
# 6: 20120102      US    Money   120    130    100    NA  

library(dplyr)
library(tidyr)
mydf %>%
  gather(variable, value, Fruit:Money) %>%
  spread(ID, value)
#       Date Country variable     1      2      3     4
# 1 20120101      US    Fruit Apple Orange Banana Melon
# 2 20120101      US Favorite  Book  Knife  Watch Water
# 3 20120101      US    Money   100    150     80    90
# 4 20120102      US    Fruit Apple  Apple Banana  <NA>
# 5 20120102      US Favorite Phone  Knife  Watch  <NA>
# 6 20120102      US    Money   120    130    100  <NA>

In this answer, mydf is defined as:

mydf <- structure(
  list(Date = c(20120101L, 20120101L, 20120101L, 
                20120101L, 20120102L, 20120102L, 20120102L), 
       Country = c("US", "US", "US", "US", "US", "US", "US"), 
       ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
       Fruit = c("Apple", "Orange", "Banana", "Melon", 
                 "Apple", "Apple", "Banana"), 
       Favorite = c("Book", "Knife", "Watch", "Water", 
                    "Phone", "Knife", "Watch"), 
       Money = c(100L, 150L, 80L, 90L, 120L, 130L, 100L)), 
  .Names = c("Date", "Country", "ID", 
             "Fruit", "Favorite", "Money"), 
  class = "data.frame", row.names = c(NA, -7L))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • It looks the same as the desired one, thanks! Does it also work if the country field has multiple values, say "US","KR","HK"etc? – C.T. Jul 18 '13 at 09:58
  • @C.T., it should. Why don't you try it on a small sample of data first. – A5C1D2H2I1M1N2O1R2T1 Jul 18 '13 at 10:00
  • Yes it works for multiple values of Country. For huge data frame, it takes time to finish the process. – C.T. Jul 18 '13 at 10:35
  • @C.T. just FYI, I find the target structure that you are trying to achieve hard to work with. A long format is generally much more user friendly. – A5C1D2H2I1M1N2O1R2T1 Jul 18 '13 at 11:07
  • Would you like to elaborate more on "long format" please? I agree that this structure is hard to achieve even using concise scripts. – C.T. Jul 19 '13 at 01:30
  • @C.T., I would recommend that you read through Hadley Wickham's [tidy data paper](http://vita.had.co.nz/papers/tidy-data.pdf). There's also [a presentation](http://vita.had.co.nz/papers/tidy-data-pres.pdf). My point is not about achieving the wide format, but what to do once you have that wide format. The "x" that I created in my answer above would be an example of the "long" format. From there, it is very easy to subset, plot, restructure, your data and so on. – A5C1D2H2I1M1N2O1R2T1 Jul 19 '13 at 01:45
  • I see. I will try to restructure the data as you recommended, thanks a lot! – C.T. Jul 19 '13 at 06:28