I have a dataset in the following (tidy) format:
SAMPLE, MARKER, ALLELE, LENGTH, PEAK
BRIS01, B100, allele 1, NA, 126.95
BRIS01, B100, allele 2, 160, 159.72
BRIS01, B100, allele 3, 162, 162.01
BRIS02, B100, allele 1, 152, 151.4
BRIS02, B100, allele 2, NA, NA
BRIS02, B100, allele 3, NA, NA
In total, each sample has an entry for 14 markers and each marker has entries for 5 alleles, even if the entry is just 'NA'. I'm not sure how many samples there are.
I have spent all day trying to restructure it into the following format, so that for each sample all the allele values are next to each other for each marker, but to no avail:
MARKER 1 MARKER 2 MARKER 3
SAMPLE 1, NA, 126.95, 160, 159.72, 162, 162.01, LENGTH, PEAK, LENGTH, PEAK
SAMPLE 2, 152, 151.4, NA, NA, NA, NA, LENGTH, PEAK, LENGTH, PEAK
If the format looks a bit messy, hopefully this might help: In each row, there should be 141 columns; the first column should contain the sample name, then from there on the allele length and peak size for 5 alleles of each marker. For example, Sample, marker 1 length 1, marker 1 peak 1, marker 1 length 2, marker 1 peak 2, marker 2 length 1, marker 2 peak 2 etc. It's a little counter-intuitive but imagine that there are column headers for each marker and then sub-columns for the size and peaks of each allele.
I have tried using dpylr, tidy data, melt, cast, dcast, reshape, reshape2, transpose... but I not very good with R and have not had any luck. Using length and peak as subcolumns is probably not very good/tidy data in practise but it's what my boss has requested to interpret the data. Any feedback appreciated!
Thanks!
EDIT: I ran the following code as suggested:
ultra_wide <-
wide %>%
group_by(SAMPLE, MARKER) %>%
gather(key = "VARS", value = "VALS", c(LENGTH, PEAK)) %>%
spread(MARKER, VALS) %>%
summarize(MARKER1 = paste(c(B100), collapse = ", "),
MARKER2 = paste(c(B132), collapse = ", "),
MARKER3 = paste(c(BL13), collapse = ", "),
MARKER4 = paste(c(BT06), collapse = ", "),
MARKER5 = paste(c(BT09), collapse = ", "),
MARKER6 = paste(c(BT30), collapse = ", "),
MARKER7 = paste(c(BTMS0044), collapse = ", "),
MARKER8 = paste(c(BTMS0067), collapse = ", "),
MARKER9 = paste(c(BTMS0106), collapse = ", "),
MARKER10 = paste(c(B116), collapse = ", "),
MARKER11 = paste(c(B118), collapse = ", "),
MARKER12 = paste(c(B119), collapse = ", "),
MARKER13 = paste(c(BT20), collapse = ", "),
MARKER14 = paste(c(BTMS0114), collapse = ", "))
however, the command didn't do anything as the following error occurred:
Error: Duplicate identifiers for rows (76, 77, 78, 79, 80), (30671, 30672, 30673, 30674, 30675), (81, 82, 83, 84, 85), (30676, 30677, 30678, 30679, 30680)
which continued for several more lines after that.