Data wrangling: Convert long format of sequences to wide format of specific subsequences

Question

I am studying state changes for multiple subject sessions. My original dataset was formatted long, with a separate row for each state for each session. Something like:

Session StateCount  State
1       0           B
1       1           C
1       2           B
1       3           B
1       4           A
…       …           …
56      26          A
56      27          B

Using the tidyR spread function in R...

d0_Spread <- spread(d0, key = "StateCount", value = "State")

...I was able to convert the data to a wide format (1 row per session, 1 column per state in the session), which was needed for some sequence-based calculations:

Session 0   1   2   3   4   …   [Nth State of Longest Session]
1       B   C   B   B   A   …                NA
…       …   …   …   …   …   …                …
56      A   C   B   C   C   …                NA

I now want to run the same calculations on two different types of subsequences within a given session (rather than across an entire session). However, I'm having difficulty wrangling the required data into wide format.

Below is the dput() output for a representative sample of the new long-format dataset:

structure(list(Session = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L, 56L
), StateCount = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), State = structure(c(2L, 
3L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 
1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"), 
    State_AType = structure(c(5L, 6L, 5L, 1L, 5L, 1L, 5L, 5L, 
    2L, 5L, 1L, 5L, 6L, 4L, 1L, 5L, 6L, 3L, 5L, 2L, 5L, 1L, 5L
    ), .Label = c("A", "A_L", "A_S", "A4", "B", "C"), class = "factor"), 
    State_AType_629Num = structure(c(9L, 10L, 9L, 3L, 9L, 4L, 
    9L, 9L, 1L, 9L, 5L, 9L, 10L, 6L, 7L, 9L, 10L, 2L, 9L, 1L, 
    9L, 8L, 9L), .Label = c("A_L", "A_S", "A1", "A2", "A3", "A4", 
    "A628", "A629", "B", "C"), class = "factor"), StateDuration = c(1L, 
    1L, 37L, 147L, 32L, 42L, 24L, 2L, 8L, 1L, 17L, 8L, 8L, 2L, 
    297L, 1L, 11L, 73L, 27L, 28L, 46L, 14L, 127L)), class = "data.frame", row.names = c(NA, 
-23L))

It, like the first dataset, has a row for each state within each session. It contains the same data columns as my first dataset (e.g., Session, StateCount, State), but also a few additional columns:

In addition to State (which codes each state as either "A", "B", or "C"), there is State_AType, which differentiates between a normal "A" (of which there are 629) and two alternative A types ("A_S" and "A_L"), a distinction that matters below.
The State_AType_629Num is a repeat column, but it provides a unique identifier for the 629 normal "A" states (e.g., A1, A2...A629), in case helpful for the below.
The State_Duration column captures the time duration of the state (needed for the below).

Using this dataset, I now need to create the following wide formats for the intended subsequence analyses:

1) First state to each normal "A" state - a wide format in which each row represents a sequence from the first state of a session to each "A" state of the session. There would be 629 such rows (one for each normal "A" only; that is, not for "A_S" or "A_L" states). Each column value would be a state within the sequence. Complicating things a bit further, I'd also like to calculate the total duration of the sequence. This means the StateDuration variable would also need to be accounted for, so that each row of the output notes the sum of StateDuration for all of the states in that sequence. For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:

SeqEndpnt      0    1   2   3   4   5   …   [Nth State of Longest Sequence] SumDuration
A1             B    C   B   NA  NA  NA  …                NA                     39
A2             B    C   B   A   B   A   …                NA                     218
A3             B    C   B   A   B   A   …                NA                     295
A4             B    C   B   A   B   A   …                NA                     328

…              …    …   …   …   …   …   …                …                      …
A628           NA   NA  NA  NA  NA  NA  …                NA                     NA
A629           A    B   C   A_S B   A_L …                NA                     483

2) Previous "A" or "A_L" state to next "A" state - a wide format in which each row represents a sequence from the previous "A" or "A_L" (whichever was more recent) to the next normal "A" (importantly, an "A_L" state could begin a sequence but cannot terminate it). As above, there would be 629 such rows, with the columns again representing a state in the sequence. The first sequence within a given session would likely be blank, as there would be no "A" previous to the first "A"; the exception would be if there were an "A_L" that preceded the first "A". For the sample data provided above, the desired data frame would be structured as follows, producing the listed values:

SeqEndpnt       0   1   2   3   4   5   …   [Nth State of Longest Sequence] SumDuration
A1              NA  NA  NA  NA  NA  NA  …                NA                     NA
A2              B   NA  NA  NA  NA  NA  …                NA                     32
A3              B   B   A_L B   NA  NA  …                NA                     35
A4              B   C   NA  NA  NA  NA  …                NA                     16
…               …   …   …   …   …   …   …                …                      …
A628            NA  NA  NA  NA  NA  NA  …                NA                     NA
A629            B   C   A_S B   A_L B   …                NA                     186

I assume the solution this time will require a combination of the spread function used earlier with some for loops. However, I'm not especially experienced with the latter, particularly in the context of iteratively adding values to a data frame. Alternatively, I thought reshape2, group_by, or mutate may assist with achieving the desired sets, but am not familiar.

Any help would be extremely appreciated!

See [mcve] and [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). There's a lot of text here, but not a whole lot of code or data (very preferable for data to be included in the post, instead of downloaded from a third-party site that we may or may not trust). — camille, Jul 23 '19 at 16:14
Actually, long format is the preferred format in nearly any analytical method. Wide is usually for reporting. Please explain the *subsequence analyses*. Couldn't a regular filter/subset work? — Parfait, Jul 23 '19 at 16:33
Thanks, @camille . I'm wasn't sure what additional code to include as I don't yet have an approach (see last paragraph). As for data, I've added examples of subsets of these rather long data frames (thousands of rows in long format, 500+ columns in wide format). As such, `dput()` didn't seem like the way to go, but if there's another suggested means of sharing more, please let me know. — jjcii, Jul 23 '19 at 17:48
Thanks, @Parfait . I've updated the post to show the desired wide format. This formatting, to my knowledge, is what is required for use with the [TraMineR package](https://cran.r-project.org/web/packages/TraMineR/index.html) for analyzing the sequences. — jjcii, Jul 23 '19 at 17:54
You can call `dput` on a representative sample of data, as suggested in the second link I posted — camille, Jul 23 '19 at 18:45
Thanks, @camille I have replaced the in-text subset with the `dput()` output of a fictitious sample. This sample includes the different edge cases that I noted need to be covered by solutions for the two desired sequencings. Additionally, the in-text tables at the end of the post (one for each sequencing) now correspond with the new sample - that is, working solutions employing the `dput()` output should produce data frames with the same values presented here. Finally, I have removed links to the original data files, so that all needed information is provided within the post. Thanks again. — jjcii, Jul 23 '19 at 21:09
@camille Thanks again. With the example now in place, any thoughts on an approach? — jjcii, Jul 29 '19 at 14:22

Data wrangling: Convert long format of sequences to wide format of specific subsequences

0 Answers0