2

From my instrumentation, I receive two different .tsv files containing my data. The first file contains, among other things, the name of the sample, its position in a 12x8 grid, and its output data. The second file contains average data from replicate sets based off the first text file. I've re-created an example of the two files in these data frames -- I actually read them using the read.table() function.

#re-creation of first .tsv file
Data <- data.frame(Name = c("100", "100", "200", "250", "1E5", "1E5", "Negative", "Negative"),
                   Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
                   Output = c("20.00", "20.10", "21.67", "23.24", "21.97", "22.03", "38.99", "38.99"))

Data
      Name Pos Output
1      100  A3  20.00
2      100  A4  20.10
3      200  B3  21.67
4      250  B4  23.24
5      1E5  C3  21.97
6      1E5  C4  22.03
7 Negative  D3  38.99
8 Negative  D4  38.99

#re-creation of second .tsv file
Replicates <- data.frame(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
                         Mean.Cq = c(20.05, 22.00, 38.99)
                         STD.Cq = c(0.05, 0.03, 0.00))

Replicates

  Replicates Mean.Cq STD.Cq
1     A3, A4 20.05   0.05
2     C3, C4 22.00   0.03
3     D3, D4 38.99   0.00

This is what I'm trying to create:

#Rename values in Replicates$Name with value in Data$Name if replicate is present; append with non-replicate data 

          Name  Mean.Cq STD.Cq
1          100  20.05   0.05
2          1E5  22.00   NA
3     Negative  38.99   NA
4          200  21.67   0.03
5          250  23.24   0.00

I can do this manually by creating a dataframe using stringr and rbind.fill from slices of the "Data" dataframe such that I keep the first instance of each name of the replicates, then remove the $Replicates column from the "Replicates" dataframe and replace it with the $Name column of the newly-created sliced dataframe. I can then append the rows of non-replicate samples to the "Replicates" dataframe. However, not all of my files have the exact same pattern of replicates, or number of samples.

I have been trying in vein to mimic this example such that I can do this process for each file set regardless of the order or number of replicates, instead of going through each one and cleaning by hand. How do I transform this manual process into a for loop to keep from having to make a bunch of sliced dataframes?

The first part of my problem I think has been the ability to detect only part of the Replicates$Replicates pattern in Data$Name , not just the individual characters For example detect either A3 OR A4 from Replicates$Replicates[1] in Data$Name, then replace the value of Replicates$Replicates with the value of the first match found in Data$Name. I'm stuck at this step.

> str_replace(Replicates$Replicates, (str_detect(Data$Name, "[Replicates$Replicates]")))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Any insight would be super helpful! Sill new to programming, bioinformatics, and data science and I'm trying to figure it out as I go on my data.

EDIT

Thank you, @Skaqqs for helping to answer the question. I made edits from his answer to fit this in the tidyverse, which I have been finding a bit easier to adapt to than base R. Splitting the replicates into two columns, then sorting and joining did the trick (and that's where I was getting stuck).

require(tidyverse)

Samples <- tibble(Name = c("100", "100", "200", "250", "1E5", "1E5", 
                            "Negative", "Negative"),
                  Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
                  Output = c("20.00", "20.10", "21.67", "23.24", "21.97", 
                           "22.03", "38.99", "38.99")) 

Replicates <- tibble(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
                     Mean.Cq = c(20.05, 22.00, 38.99),
                     STD.Cq = c(0.05, 0.03, 0.00))

Samples %>%
        .[str_order(.$Pos, numeric = TRUE),]    

Replicates %>%  
        mutate("R1" = gsub(x = Replicates, pattern = "^(.*),.*", replacement = "\\1")) %>%
        mutate("R2" = gsub(x = Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
        pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>% 
        select("Mean.Cq", "STD.Cq", "Wells") %>% 
        relocate("Wells", 1) %>%
        right_join(Samples, by = c("Wells"="Pos")) %>%
        .[str_order(.$Wells, numeric = TRUE),] %>% 
        select("Name", "Mean.Cq", "STD.Cq") %>% 
        distinct(Name, .keep_all = TRUE)    
        
    # A tibble: 5 x 3
      Name     Mean.Cq STD.Cq
      <chr>      <dbl>  <dbl>
    1 100         20.0   0.05
    2 200         NA    NA   
    3 250         NA    NA   
    4 1E5         22     0.03
    5 Negative    39.0   0 
allisonrs
  • 102
  • 8

1 Answers1

2

This sounds like a join/merge question to me. My suggestion is to split Replicates$Replicates into two fields and essentially treat their data separately too. Then after joining your two Replicates tables with Data, use unique() to drop duplicates in your summary table.

library(dplyr)

# Split `Replicates$Replicates` into two fields
# This assumes your `Replicates` field is has two values, seperated by a comma and whitespace 
Replicates$R1 <- gsub(x = Replicates$Replicates, pattern = "^(.*),.*", replacement = "\\1")
Replicates$R2 <- gsub(x = Replicates$Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")

# Inner-join `Data` and `Replicates` by `R1` and `R2`
df <- merge(Data, Replicates,  by.x = "Pos", by.y = "R1", all.x = FALSE)
df2 <- merge(Data, Replicates, by.x = "Pos", by.y = "R2", all.x = FALSE)
df3 <- dplyr::bind_rows(df, df2)

unique(df3[,c("Name", "Mean.Cq", "STD.Cq")])
#>       Name Mean.Cq STD.Cq
#> 1      100   20.05   0.05
#> 2      1E5   22.00   0.03
#> 3 Negative   38.99   0.00
Skaqqs
  • 4,010
  • 1
  • 7
  • 21
  • Awesome, got this to work with some adaptations for use in the tidyverse. The gsub was key, though I'm still confused about how exactly it works. In the context of gsub(old, new, string): old = (x=Replicates$Replicates), new = pattern, and string = replacement. -- How exactly does this pattern work? Why does the pattern for Replicates$R1 need the ^ and $R2 need \\s? -- How does "\\1" in replacement work? Does it take the pattern that is in parentheses? Thank you again! – allisonrs Sep 17 '21 at 23:58
  • 1
    Great, glad my answer helped! `gsub()` replaces `pattern` with `replacement` and `pattern` can be a regular expression, like I used in my answer. There are tons of regex resources online including the surprisingly entertaining regexcrossword.com, if you want to dive deeper. `^` is the start of the string and `\\s` is whitespace. There are probably a few patterns that get the right values from Replicates$Replicates; this one just happened to work for your sample data and is specific/flexible enough. Yep, "\\1" corresponds to whatever is matched in the first parenthesis in `pattern`. – Skaqqs Sep 18 '21 at 00:10
  • 2
    If you feel your question was answered and to help people find the answer in the future, would you mind accepting my answer, or posting/accepting your own with your tidyverse adaptations. Thanks! (Happy to answer any more questions too.) – Skaqqs Sep 18 '21 at 00:13