I get data in a key-value format with an unique identifier. key
are variable, and their value is stored in value
. ID
identifies the moment when they were collected. As such:
data <- data.table(ID=c(1, 1, 2, 1, 3, 1, 1, 3, 2), key=c("A", "A","A","B", "B", "B", "B", "C", "C"), value=c(6,7,3,5,8,11,7,8,7))
# ID key value
#1 1 A 6
#2 1 A 7
#3 2 A 3
#4 1 B 5
#5 3 B 8
#6 1 B 11
#7 1 C 7
#8 3 C 8
#9 2 C 7
I want to be able to use the following functions:
- Visualization:
ggplot::ggplot
, - Filtering/Subsetting:
dplyr::filter
,dplyr::select
- Grouping:
dplyr::group_by
, Summary:dplyr::summarise
- Statistical tests:
stats::aggregate
,stats::t.test
,stats::shapiro.test
,stats::fligner.test
,stats::pairwise.t.test
,stats::kruskal.test
,nlme::lme
,car::leveneTest
,ez:ezANOVA
I have two mains issues:
dplyr
related functions assumes that you have a wide format. Thus, how can Ifilter
/group_by
,summarise
? Concretely, if I want to have themean
ofA
group_by
B
only whenC > 10
?- In terms of performance, I quickly have few millions
base::nrow
, could it have an impact?
Are there other risks to this key-value long format approach?
Should I reshape every time I need to use some of the above functions? Should I store everything in one wide dataframe with lists (see below)?
Here is an example of what I thought could work
Let's say I have a long dataframe that has multiple values (temp
) for the same key (year
). I want to be able to represent it as wide data even in the case where they are multiple occurrences for the same values. How can I best achieve that in R?
It looks like:
# Create long dataset
library(data.table)
country_long <- data.table(country=c("Sweden", "Sweden", "Norway", "Sweden", "Denmark","Sweden", "Sweden", "Denmark", "Norway"), year=c("Y1994", "Y1994","Y1994","Y1995", "Y1995", "Y1995", "Y1996", "Y1996", "Y1996"), temp=c(6,7,3,5,8,11,7,8,7))
country_long
#country year temp
#1 Sweden Y1994 6
#2 Sweden Y1994 7
#3 Norway Y1994 3
#4 Sweden Y1995 5
#5 Denmark Y1995 8
#6 Sweden Y1995 11
#7 Sweden Y1996 7
#8 Denmark Y1996 8
#9 Norway Y1996 7
I would like to make it wide so that when there is more than one temp
, it is put into a list()
. People usually use reshape2
or tidyr
but I am not sure how it works when you have more than one value per key. It would look like:
country_wide <-data.table(country = c("Sweden", "Denmark", "Norway"),Y1994=list(c(6,7),"",3),Y1995=list(c(5,11),8,""),Y1996=c(7,8,7))
country_wide
#country Y1994 Y1995 Y1996
#1 Sweden 6,7 5,11 7
#2 Denmark 8 8
#3 Norway 3 7
Finally, from there how can spread one column with multiple values? Something like tidyr:gather()
could work but I am not sure how this could be applied to this situation.
country_wide_expended <-data.table(country = c("Sweden","Sweden","Denmark", "Norway"),k=c(1,2,1,1), Y1994=list(6,7,"",3),Y1995=list(c(5,11),c(5,11),8,""),Y1996=c(7,7,8,7))
country_wide_expended
#country k Y1994 Y1995 Y1996
#1 Sweden 1 6 5,11 7
#1 Sweden 2 7 5,11 7
#2 Denmark 1 8 8
#3 Norway 1 3 7
I can then use this dataframe to analyze (lm, lme, t-test...) and visualize (ggplot) Y1994
.
(Adapted from https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/)