0

I have a dataset in the following (tidy) format:

SAMPLE, MARKER, ALLELE, LENGTH, PEAK
BRIS01, B100, allele 1, NA, 126.95
BRIS01, B100, allele 2, 160, 159.72
BRIS01, B100, allele 3, 162, 162.01
BRIS02, B100, allele 1, 152, 151.4
BRIS02, B100, allele 2, NA, NA
BRIS02, B100, allele 3, NA, NA

In total, each sample has an entry for 14 markers and each marker has entries for 5 alleles, even if the entry is just 'NA'. I'm not sure how many samples there are.

I have spent all day trying to restructure it into the following format, so that for each sample all the allele values are next to each other for each marker, but to no avail:

                MARKER 1                              MARKER 2      MARKER 3
      SAMPLE 1, NA, 126.95, 160, 159.72, 162, 162.01, LENGTH, PEAK, LENGTH, PEAK
      SAMPLE 2, 152, 151.4, NA, NA, NA, NA,           LENGTH, PEAK, LENGTH, PEAK

If the format looks a bit messy, hopefully this might help: In each row, there should be 141 columns; the first column should contain the sample name, then from there on the allele length and peak size for 5 alleles of each marker. For example, Sample, marker 1 length 1, marker 1 peak 1, marker 1 length 2, marker 1 peak 2, marker 2 length 1, marker 2 peak 2 etc. It's a little counter-intuitive but imagine that there are column headers for each marker and then sub-columns for the size and peaks of each allele.

I have tried using dpylr, tidy data, melt, cast, dcast, reshape, reshape2, transpose... but I not very good with R and have not had any luck. Using length and peak as subcolumns is probably not very good/tidy data in practise but it's what my boss has requested to interpret the data. Any feedback appreciated!

Thanks!

EDIT: I ran the following code as suggested:

ultra_wide <-
  wide %>%
  group_by(SAMPLE, MARKER) %>%
  gather(key = "VARS", value = "VALS", c(LENGTH, PEAK)) %>%
  spread(MARKER, VALS) %>%
  summarize(MARKER1 = paste(c(B100), collapse = ", "), 
            MARKER2 = paste(c(B132), collapse = ", "),
            MARKER3 = paste(c(BL13), collapse = ", "),
            MARKER4 = paste(c(BT06), collapse = ", "),
            MARKER5 = paste(c(BT09), collapse = ", "),
            MARKER6 = paste(c(BT30), collapse = ", "),
            MARKER7 = paste(c(BTMS0044), collapse = ", "),
            MARKER8 = paste(c(BTMS0067), collapse = ", "),
            MARKER9 = paste(c(BTMS0106), collapse = ", "),
            MARKER10 = paste(c(B116), collapse = ", "),
            MARKER11 = paste(c(B118), collapse = ", "),
            MARKER12 = paste(c(B119), collapse = ", "),
            MARKER13 = paste(c(BT20), collapse = ", "),
            MARKER14 = paste(c(BTMS0114), collapse = ", "))

however, the command didn't do anything as the following error occurred:

Error: Duplicate identifiers for rows (76, 77, 78, 79, 80), (30671, 30672, 30673, 30674, 30675), (81, 82, 83, 84, 85), (30676, 30677, 30678, 30679, 30680)

which continued for several more lines after that.

  • 1
    You meant the values, of LENGTH, PEAK, right? – akrun Apr 17 '19 at 14:44
  • yeah sorry, let me edit that. – yaminatori Apr 17 '19 at 14:48
  • The edited expected output for MARKER2, MARKER3 is not making sense – akrun Apr 17 '19 at 14:51
  • Let me understand. Do you want to transpose the alleles by marker, right? May you give us a simple example of output? Seams RNA-Seq data, do you want to treat by marker, by alleles or both? – Aureliano Guedes Apr 17 '19 at 14:58
  • It's MicroSat data. For each marker, I'd like the allele length and peak size for all 5 alleles next to each other. Then, instead of underneath, the data for the next marker should be to the right. Does that make sense? So for each marker, 10 columns in total. Each row should have 141 columns - the sample ID, followed by the data described above. – yaminatori Apr 17 '19 at 15:21

1 Answers1

0

Data Entry

First of all, please submit the code that recreates your dataframe so that it is easy for the next person to just copy-and-paste the code and see the dataframe for themselves. In here I am just attempting to recreate the dataframe as to your specifications, specifically the part where you mention that there are five alleles per marker.

# Vectors for dataframe

library(tidyverse)

SAMPLE <- c(rep("BRIS01", 5), rep("BRIS02", 5))
MARKER <- c(rep("B100", 5), rep("B200", 5))
ALLELE <- rep(paste("allele",1:5), times = 2)
LENGTH <- c(NA, 160, 162, 152, NA, NA, 160:163)
PEAK <- c(126.95,   159.72, 162.01, 151.4,  NA, NA, 150:153)

marker_data <- data.frame(SAMPLE, MARKER, ALLELE, LENGTH, PEAK, stringsAsFactors = FALSE)

marker_data
#>    SAMPLE MARKER   ALLELE LENGTH   PEAK
#> 1  BRIS01   B100 allele 1     NA 126.95
#> 2  BRIS01   B100 allele 2    160 159.72
#> 3  BRIS01   B100 allele 3    162 162.01
#> 4  BRIS01   B100 allele 4    152 151.40
#> 5  BRIS01   B100 allele 5     NA     NA
#> 6  BRIS02   B200 allele 1     NA     NA
#> 7  BRIS02   B200 allele 2    160 150.00
#> 8  BRIS02   B200 allele 3    161 151.00
#> 9  BRIS02   B200 allele 4    162 152.00
#> 10 BRIS02   B200 allele 5    163 153.00

Note that in data.frame I am passing the option stringsAsFactors = FALSE since dealing with factor variables tends to be very tricky.

"Spreading" Your Data

As for your output, I am outputting what you've shown in tabular form as your desired outcome. Without more data, it is hard to get the 141 columns that you want per row. The key to getting your answer is to "spread" the MARKER column after "gathering" (or "melting" as it more commonly referred) the columns that have 'values', namely the LENGTH and PEAK columns. Before spreading; however, you should create a column with unique values in case the spreading encounters identical rows. Finally, you have to summarize to get one row for each sample, although you would want loop through the MARKER1-MARKER14 columns for more optimum/efficient code. Anyways I hope this helps.

marker_m <- 
  marker_data %>% 
  group_by(SAMPLE, MARKER) %>%
  gather(VARS, VALS, c(LENGTH, PEAK)) %>%
  mutate(i = row_number()) %>%
  spread(MARKER, VALS) %>% 
  summarize(MARKER1 = paste(c(B100), collapse = ", "), MARKER2 = paste(c(B200), collapse = ", "))

marker_m
#> # A tibble: 2 x 3
#>   SAMPLE MARKER1                                                  MARKER2 
#>   <chr>  <chr>                                                    <chr>   
#> 1 BRIS01 NA, 126.95, 160, 159.72, 162, 162.01, 152, 151.4, NA, NA NA, NA,~
#> 2 BRIS02 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA                   NA, NA,~
Diego
  • 392
  • 3
  • 16
  • Unfortunately there is no code to produce the data frame, it is simply read in by an excel file. It looks very similar to the marker_data that you produced, except that there are 14 markers detected in each sample (the same markers in each sample rather than marker 1 -> sample 1, marker 2 -> sample 2). Thanks for the help with the data spreading, I'll see what I can make of it. – yaminatori Apr 23 '19 at 08:03
  • I see why you are getting that problem. I recommend you read on the following respose: https://stackoverflow.com/a/45898919/5425826. Simply put, your data cannot be spread because the MARKER column cannot 'uniquely` identify the row, which is why you are seeing the 'Duplicate Identifiers for rows` error message. Can you post your question with the result of dput(wide)? – Diego Apr 23 '19 at 14:12
  • the output of dput(wide) in the console is truncated as it's too large, but the last few lines read: "299.66", "299.82", "301.52", "302.36", "302.37", "302.38", "302.4", "304.88"), class = "factor")), row.names = c(NA, 30595L), class = "data.frame") when I open the data frame with view() it just shows me the same table as wide. – yaminatori Apr 24 '19 at 13:12
  • OK. Well I'm going to modify my response and include `mutate(i = row_number())` right before spreading to make sure we have unique rows. But it's hard for another person to know what happens if you don't post a reproducible example. – Diego Apr 24 '19 at 16:36