I am studying state changes for multiple subject sessions. My original dataset was formatted long, with a separate row for each state for each session. Something like:
Session StateCount State
1 0 B
1 1 C
1 2 B
1 3 B
1 4 A
… … …
56 26 A
56 27 B
Using the tidyR
spread
function in R...
d0_Spread <- spread(d0, key = "StateCount", value = "State")
...I was able to convert the data to a wide format (1 row per session, 1 column per state in the session), which was needed for some sequence-based calculations:
Session 0 1 2 3 4 … [Nth State of Longest Session]
1 B C B B A … NA
… … … … … … … …
56 A C B C C … NA
I now want to run the same calculations on two different types of subsequences within a given session (rather than across an entire session). However, I'm having difficulty wrangling the required data into wide format.
Below is the dput()
output for a representative sample of the new long-format dataset:
structure(list(Session = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L
), StateCount = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 13L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), State = structure(c(2L,
3L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L,
1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
State_AType = structure(c(5L, 6L, 5L, 1L, 5L, 1L, 5L, 5L,
2L, 5L, 1L, 5L, 6L, 4L, 1L, 5L, 6L, 3L, 5L, 2L, 5L, 1L, 5L
), .Label = c("A", "A_L", "A_S", "A4", "B", "C"), class = "factor"),
State_AType_629Num = structure(c(9L, 10L, 9L, 3L, 9L, 4L,
9L, 9L, 1L, 9L, 5L, 9L, 10L, 6L, 7L, 9L, 10L, 2L, 9L, 1L,
9L, 8L, 9L), .Label = c("A_L", "A_S", "A1", "A2", "A3", "A4",
"A628", "A629", "B", "C"), class = "factor"), StateDuration = c(1L,
1L, 37L, 147L, 32L, 42L, 24L, 2L, 8L, 1L, 17L, 8L, 8L, 2L,
297L, 1L, 11L, 73L, 27L, 28L, 46L, 14L, 127L)), class = "data.frame", row.names = c(NA,
-23L))
It, like the first dataset, has a row for each state within each session. It contains the same data columns as my first dataset (e.g., Session
, StateCount
, State
), but also a few additional columns:
- In addition to
State
(which codes each state as either "A", "B", or "C"), there isState_AType
, which differentiates between a normal "A" (of which there are 629) and two alternative A types ("A_S" and "A_L"), a distinction that matters below. - The
State_AType_629Num
is a repeat column, but it provides a unique identifier for the 629 normal "A" states (e.g., A1, A2...A629), in case helpful for the below. - The
State_Duration
column captures the time duration of the state (needed for the below).
Using this dataset, I now need to create the following wide formats for the intended subsequence analyses:
1) First state to each normal "A" state - a wide format in which each row represents a sequence from the first state of a session to each "A" state of the session. There would be 629 such rows (one for each normal "A" only; that is, not for "A_S" or "A_L" states). Each column value would be a state within the sequence. Complicating things a bit further, I'd also like to calculate the total duration of the sequence. This means the StateDuration
variable would also need to be accounted for, so that each row of the output notes the sum of StateDuration
for all of the states in that sequence. For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:
SeqEndpnt 0 1 2 3 4 5 … [Nth State of Longest Sequence] SumDuration
A1 B C B NA NA NA … NA 39
A2 B C B A B A … NA 218
A3 B C B A B A … NA 295
A4 B C B A B A … NA 328
… … … … … … … … … …
A628 NA NA NA NA NA NA … NA NA
A629 A B C A_S B A_L … NA 483
2) Previous "A" or "A_L" state to next "A" state - a wide format in which each row represents a sequence from the previous "A" or "A_L" (whichever was more recent) to the next normal "A" (importantly, an "A_L" state could begin a sequence but cannot terminate it). As above, there would be 629 such rows, with the columns again representing a state in the sequence. The first sequence within a given session would likely be blank, as there would be no "A" previous to the first "A"; the exception would be if there were an "A_L" that preceded the first "A". For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:
SeqEndpnt 0 1 2 3 4 5 … [Nth State of Longest Sequence] SumDuration
A1 NA NA NA NA NA NA … NA NA
A2 B NA NA NA NA NA … NA 32
A3 B B A_L B NA NA … NA 35
A4 B C NA NA NA NA … NA 16
… … … … … … … … … …
A628 NA NA NA NA NA NA … NA NA
A629 B C A_S B A_L B … NA 186
I assume the solution this time will require a combination of the spread
function used earlier with some for
loops. However, I'm not especially experienced with the latter, particularly in the context of iteratively adding values to a data frame. Alternatively, I thought reshape2
, group_by
, or mutate
may assist with achieving the desired sets, but am not familiar.
Any help would be extremely appreciated!