I inherited a dataset from a colleague that I'd like to analyze in R in long-form (currently wide-format data). Here's a subset of the data:
structure(list(Phase = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L), .Label = c("1_baseline", "2_first_10mins", "3_second_10mins",
"4_final_10mins"), class = "factor"), MaleID = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), Weight_g = c(48.91, 84.81, 48.91, 84.81,
48.91, 84.81, 48.91, 84.81), FemaleID = c(1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L), Female_side = structure(c(2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L), .Label = c("Left", "Right"), class = "factor"), Date = structure(c(2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("7/11/2016 10:53", "7/11/2016 9:26"
), class = "factor"), Trx = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "G1", class = "factor"), Center_All_TotalDistMoved = c(2385,
3284.86, 1612.03, 2656.36, 1674.68, 1174.75, 1543.54, 629.412
), Nose_Left_CumDur = c(112.64, 248.32, 64.24, 91.12, 82.56,
22.8, 71.88, 16.48), Nose_Left_FreqZoneEntry = c(260L, 602L,
60L, 208L, 77L, 16L, 139L, 10L), Nose_Left_LatencyFirstEntry = c(4.6,
0.08, 1571.84, 1560, 2168.16, 2161.96, 2776.28, 2790.72), Nose_Right_CumDur = c(92.4,
40.48, 194.88, 426.96, 188.76, 500.44, 93.92, 559.56), Nose_Right_FreqZoneEntry = c(174L,
100L, 257L, 229L, 277L, 159L, 182L, 54L), Nose_Right_LatencyFirstEntry = c(0.12,
117.44, 1565.2, 1660.72, 2160, 2165.4, 2790.2, 2760.76), Nose_Dead_CumDur = c(327.68,
112.88, 176.84, 25.16, 143.44, 45.32, 314.48, 12.16), Nose_Dead_FreqZoneEntry = c(536L,
722L, 370L, 142L, 437L, 148L, 455L, 52L), Nose_Dead_LatencyFirstEntry = c(0,
0, 1562.92, 1684.04, 2160.6, 2160, 2760, 2760), Center_Left_CumDur = c(73.2,
260.96, 37.24, 102.68, 46.16, 20.32, 32.92, 17.52), Center_Left_FreqZoneEntry = c(23L,
74L, 9L, 8L, 13L, 4L, 14L, 1L), Center_Left_LatencyFirstEntry = c(108.16,
0, 1792.32, 1560, 2170.76, 2162.8, 2804.04, 2792.4), Center_Right_CumDur = c(63.16,
25.44, 154, 415.56, 131.16, 475.4, 64.28, 541.68), Center_Right_FreqZoneEntry = c(10L,
8L, 26L, 26L, 26L, 22L, 29L, 13L), Center_Right_LatencyFirstEntry = c(121.64,
139.04, 1673.12, 1661.84, 2160, 2166.08, 2790.72, 2760), Center_Dead_CumDur = c(441.64,
101.76, 243.56, 62.76, 198, 87.88, 430.28, 32.4), Center_Dead_FreqZoneEntry = c(34L,
72L, 38L, 23L, 37L, 26L, 46L, 8L), Center_Dead_LatencyFirstEntry = c(0,
20.28, 1560, 1706.4, 2160.96, 2160, 2760, 2770.84), Center_All_MeanVelocity = c(3.98271,
5.4766, 2.68744, 4.42786, 2.79113, 1.95818, 2.57738, 1.05056)), class = "data.frame", row.names = c(NA,
-8L))
Within each of columns (specifically, columns 8 - 27), there's 3 factors/measurements that I want to extract:
- Point (variables = Center; Nose);
- Zone (variables = Left; Right; Dead);
- Measurement (several variables: TotalDistMoved, CumDur, FreqZoneEntry, LatencyFirstEntry, MeanVelocity).
Based on two similar posts here (one, two) I tried to use the following code on a subset of my data (just one column first, then to scale up), but I'm very confused by regex, which I believe is part of my problem (not sure how to parse 3 columns/values based on "_" as separator):
df %>%
pivot_longer(Center_All_TotalDistMoved,
names_to = c("Point", ".value"),
names_pattern = "([^_]+)_(.*)")
This does an ok job, and parses out the "Point" factor:
# A tibble: 151 x 9
Phase MaleID Weight_g FemaleID Female_side Date Trx Point All_TotalDistMoved
<fct> <int> <fct> <int> <fct> <fct> <fct> <chr> <dbl>
1 1_baseline 1 48.91 1 Right 7/11/2016 9:26 G1 Center 2385
2 1_baseline 2 84.81 2 Right 7/11/2016 10:53 G1 Center 3285.
3 1_baseline 3 VERIFY_NUMBER 3 Right 7/12/2016 9:50 VEH Center 2397.
4 1_baseline 4 69.15 1 Left 7/12/2016 12:32 VEH Center 4011.
5 1_baseline 5 86.15 2 Left 7/13/2016 9:52 G1 Center 2746.
6 1_baseline 6 60.17 3 Left 7/13/2016 11:30 G1 Center 1994.
7 1_baseline 7 46.82 1 Left 7/14/2016 10:10 VEH Center 2910.
8 1_baseline 8 35.52 2 Left 7/14/2016 11:59 VEH Center 1892.
9 1_baseline 9 29.92 3 Right 7/15/2016 9:54 G1 Center 4823.
10 1_baseline 10 28.96 1 Left 7/15/2016 11:32 G1 Center 2119.
However, I can't scale the above code up within the one column without an error due to my regex ignorance. Further, I'm unsure how to do this same operation across the rest of the columns, and still have the data matched by MaleID
and Phase
. Ideally, this is what the dataframe should look like:
MaleID Weight_g Treatment Phase FemaleID FemaleLocation Date Point Zone TotalDistMoved CumDur FreqZoneEntry LatencyFirstEntry MeanVelocity
1 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Center All 2385 NA NA NA 3.98271
2 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Nose Left NA 112.64 260 4.60 NA
3 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Nose Right NA 92.40 174 0.12 NA
4 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Nose Dead NA 327.68 536 0.00 NA
5 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Center Left NA 73.20 23 108.16 NA
6 1 48.91 G1 Baseline 1 Right 7/11/2016 9:26 Center Right NA 63.16 10 121.64 NA
Thank you!