1

I get data in a key-value format with an unique identifier. key are variable, and their value is stored in value. ID identifies the moment when they were collected. As such:

data <- data.table(ID=c(1, 1, 2, 1, 3, 1, 1, 3, 2), key=c("A", "A","A","B", "B", "B", "B", "C", "C"), value=c(6,7,3,5,8,11,7,8,7))

 #   ID key  value
 #1  1   A     6
 #2  1   A     7
 #3  2   A     3
 #4  1   B     5
 #5  3   B     8
 #6  1   B     11
 #7  1   C     7
 #8  3   C     8
 #9  2   C     7

I want to be able to use the following functions:

  • Visualization:ggplot::ggplot,
  • Filtering/Subsetting:dplyr::filter,dplyr::select
  • Grouping:dplyr::group_by, Summary: dplyr::summarise
  • Statistical tests:stats::aggregate, stats::t.test,stats::shapiro.test, stats::fligner.test,stats::pairwise.t.test,stats::kruskal.test,nlme::lme,car::leveneTest,ez:ezANOVA

I have two mains issues:

  1. dplyr related functions assumes that you have a wide format. Thus, how can I filter / group_by ,summarise? Concretely, if I want to have the mean of A group_by B only when C > 10?
  2. In terms of performance, I quickly have few millions base::nrow, could it have an impact?

Are there other risks to this key-value long format approach?

Should I reshape every time I need to use some of the above functions? Should I store everything in one wide dataframe with lists (see below)?

Here is an example of what I thought could work

Let's say I have a long dataframe that has multiple values (temp) for the same key (year). I want to be able to represent it as wide data even in the case where they are multiple occurrences for the same values. How can I best achieve that in R?

It looks like:

# Create long dataset
library(data.table)
country_long <- data.table(country=c("Sweden", "Sweden", "Norway", "Sweden", "Denmark","Sweden", "Sweden", "Denmark", "Norway"), year=c("Y1994", "Y1994","Y1994","Y1995", "Y1995", "Y1995", "Y1996", "Y1996", "Y1996"), temp=c(6,7,3,5,8,11,7,8,7))
country_long

   #country year    temp
 #1  Sweden Y1994       6
 #2  Sweden Y1994       7
 #3  Norway Y1994       3
 #4  Sweden Y1995       5
 #5 Denmark Y1995       8
 #6  Sweden Y1995      11
 #7  Sweden Y1996       7
 #8 Denmark Y1996       8
 #9  Norway Y1996       7

I would like to make it wide so that when there is more than one temp, it is put into a list(). People usually use reshape2 or tidyr but I am not sure how it works when you have more than one value per key. It would look like:

country_wide <-data.table(country = c("Sweden", "Denmark", "Norway"),Y1994=list(c(6,7),"",3),Y1995=list(c(5,11),8,""),Y1996=c(7,8,7))

country_wide 
#country      Y1994    Y1995       Y1996
#1 Sweden     6,7       5,11         7
#2 Denmark                 8         8
#3 Norway     3                      7

Finally, from there how can spread one column with multiple values? Something like tidyr:gather() could work but I am not sure how this could be applied to this situation.

country_wide_expended <-data.table(country = c("Sweden","Sweden","Denmark", "Norway"),k=c(1,2,1,1), Y1994=list(6,7,"",3),Y1995=list(c(5,11),c(5,11),8,""),Y1996=c(7,7,8,7))
country_wide_expended 
#country       k  Y1994    Y1995    Y1996
#1 Sweden      1     6       5,11      7
#1 Sweden      2     7       5,11      7
#2 Denmark     1              8        8
#3 Norway      1     3                 7

I can then use this dataframe to analyze (lm, lme, t-test...) and visualize (ggplot) Y1994.

(Adapted from https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/)

Wistar
  • 3,770
  • 4
  • 45
  • 70
  • Use set.seed to make sure we're looking at the same example data. Also because you're using expand.grid, the phenomenon you describe won't show up -- there is exactly one obs per country-year combo – Frank Nov 06 '17 at 19:51
  • @Frank, I edited the code. There are two occurrences for `Sweden` in `Y1994` and `Y1995`. – Wistar Nov 06 '17 at 19:57
  • 2
    What is your desired result? And long format is the preferred structure of most data analyses and in most data stores like normalization in databases. Usually reshaping/pivoting is for report needs. How can nested lists help in `lm()`, `t.test()` procedures? Almost every case I know nested lists are never a good idea. – Parfait Nov 06 '17 at 19:58
  • Ok, thanks. Since you're using data.table, maybe `dcast(country_long, country ~ year, fun = list)`, and it should work the same with reshape2. I agree with Parfait that this format is not useful for most analysis in R, just for browsing or perhaps publishing. – Frank Nov 06 '17 at 19:59
  • @Parfait You are right, that is the root of the issue. I can thing of two things that make me hesitate to stick to the long. 1) Performance. This is a simplified example but have a data.frame with >30 columns and few thousand lines can quickly add up in a long key-value format. 2) Filtering. I use `dplyr::filter()` which I believe require a wide format. Subsetting does not appear to be easy in a long format. If you have some suggestions on how to deal with these issue, I would be really grateful. – Wistar Nov 06 '17 at 20:39
  • @Frank That's right. However, even the numeric column becomes list. How can I then "`gather`" the lists (see last table)? – Wistar Nov 06 '17 at 20:40
  • 1
    Your question is an [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You are asking help for a Y solution but not the X problem. Why do you need to use embedded lists in a dataframe? What analysis requires this format? Tell us the real reason and we can help. – Parfait Nov 06 '17 at 20:44
  • @Parfait I edited the question to explained the root problem before going to a possible solution. – Wistar Nov 06 '17 at 21:29
  • *dplyr related functions assumes that you have a wide format.*...not sure where you read this but all those functions you post requires long format, and your key/value pair can work with such methods.. – Parfait Nov 06 '17 at 21:48
  • @Parfait I might be missing something. For instance, how would you go "to have the mean of A group_by B only when C > 10?" – Wistar Nov 06 '17 at 21:51
  • Ahhhh! I see what's going on. You have a melted dataset where *key* are columns. Yes, you need to reshape, un-melt or dcast, but NOT for *any* nested lists and then do your aggregation/processing. See [here](https://stackoverflow.com/q/24880835/1422451). – Parfait Nov 06 '17 at 22:04
  • @Parfait. That's right! Thank you for the link and your input, it is truly useful. If you think of any other resources on how people deal with such melted dataset I would certainly appreciate looking at it. – Wistar Nov 06 '17 at 22:23

1 Answers1

0

To Make a dataframe wide you can use t() function. Here-

> data <- data.table(ID=c(1, 1, 2, 1, 3, 1, 1, 3, 2), key1=c("A", 
"A","A","B", "B", "B", "B", "C", "C"), value=c(6,7,3,5,8,11,7,8,7))
> data
 ID key1 value
1:  1    A     6
2:  1    A     7
3:  2    A     3
4:  1    B     5
5:  3    B     8
6:  1    B    11
7:  1    B     7
8:  3    C     8
9:  2    C     7
> t(data)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
ID    "1"  "1"  "2"  "1"  "3"  "1"  "1"  "3"  "2" 
key1  "A"  "A"  "A"  "B"  "B"  "B"  "B"  "C"  "C" 
value " 6" " 7" " 3" " 5" " 8" "11" " 7" " 8" " 7"

Regarding rest of your problem ,it is little bit difficult to understand.Please explain it clearly . Like-

1.What kind of analysis you want to perform. 2.In which formate you want to see your data for analysis.

Answer of these questions will help me to assist you further.

Thanks

mehakVT
  • 167
  • 1
  • 8