0

I posted last week about how to replace values in one dataframe with value from another dataframe if conditions are met to create my desired result on a single set of data (which I have since solved). Now, I am trying to create a loop that can be executed for all the files I have in a folder.

Briefly, each set of data has a matching pair .tsv files: One of the raw data, and one of the means of the replicates (specified in the instrument software prior to exporting .tsv files). An example pairing would look like "072621Liver1.tsv" (the raw data file) and "072621Liver1_replicates.tsv" (means data). The previous example I posted describes how I created a single tibble from the two files.

Now, I am struggling to batch process all the paired files in my data set. I posted my best solution to this so far, but I'm still getting error messages like the one below.

(Error: Cannot open file for writing: 'C:\Users\asmit\Desktop\pratice_files\072621Liver1.tsv)'

If I don't get that message and the code runs, the .csv files I am also trying to write are not being created as they don't show up in the folder after it executes. I know something is off but I can't put my finger on what exactly it is. I've posted my script for best solution I've come up with thus far below... any help to get this to actually run would be really appreciated! I feel like I'm close to the answer but can't quite get there.

###import .tsv of PCR results, formatting, natural sort. Export cleaned file as .csv.
###import .tsv of Replicates file, split $Samples column in two and re-combine in Replicates.
###Export modified Replicates tibble to new .csv file.

##environment setup; change folder accordingly. Install tidyverse if needed.
#setwd("C:/Users/asmit/Desktop/pratice_files/pratice/")
##install.packages(tidyverse)

library(tidyverse)

### lists of all sets of files (lists are the same length)
singlet_files <- list.files(path = ".", pattern = "[^replicates]\\.tsv")
singlet_cleaned <- list.files(path = ".", pattern = "[_cleaned]\\.csv")
matching_pair_files <- list.files(path = ".", pattern = "[replicates]\\.tsv")


tibble_singlet <- function(x) { ###function to create tibble from singlet files
  cleanup_tibble <- as_tibble(read_tsv(x, col_names = TRUE, skip = 1))
}

singlet_cleanup <- function(x) { ##function to clean singlet files
  new_file <- str_replace(x, "[.*].tsv", "_cleaned.csv")
  tibble_singlet(x) %>%
    select("Pos", "Name", "Cp", "Concentration") %>%
    .[str_order(.$Pos, numeric = TRUE),] %>%
    write_csv(file = new_file)
}

lapply(singlet_files, singlet_cleanup) ## <- run (singlet_cleanup) on files in singlet_files. 
                                       ##I get the error code here. If I skip over this part 
                                       ##and only run the second half (below) it works,     
                                       ##but I don't get any output from it.



cleaned_tibble <- function(y) { ##function to read cleaned .csv files as tibble
  Pos_tibble <- as_tibble(read_csv(y, col_names = TRUE)) 
}

match <- function(m){ ##function to make tibble of replicate file
  match_tibble <- as_tibble(read_tsv(m, col_names = TRUE, skip = 1))
}

merged <- function(m,y){ ##function to merge match tibble with specific column of cleaned_tibble tibble
  organ <- regmatches(m, regexpr("(Liver|Lung|Kidney|Spleen)", m))
  output_file <- gsub(".*replicates.tsv", ".*final.csv", m)
  match(m) %>%
    mutate("R1" = gsub(x = .$Samples, pattern = "^(.*),.*", replacement = "\\1")) %>%
    mutate("R2" = gsub(x = .$Samples, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
    pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>%
    select("MeanCp", "STD Cp", "Mean conc", "STD conc", "Wells") %>%
    relocate("Wells", 1) %>%
    right_join((cleaned_tibble(y)), by = c("Wells"="Pos")) %>%
    .[str_order(.$Wells, numeric = TRUE),] %>%
    select("Name", "MeanCp", "STD Cp", "Mean conc", "STD conc") %>%
    distinct(Name, .keep_all = TRUE) %>%
    add_column(Organ = organ) %>%
    write_csv(file = output_file)
}

map2(m=matching_pair_files, y=singlet_cleaned, ~merged(m,y)) ##I feel like this isn't correct, 
                                                             ##but don't know how to fix it to 
                                                             ##actually process correctly 

EDIT breaking the code up into two parts. Corrected attempted @regexp, error messaging.

First part (which now works thanks to @scrameri)

###import .tsv of PCR results, formatting, natural sort. Export cleaned file as .csv.
###import .tsv of Replicates file, split $Samples column in two and re-combine in Replicates.
###Export modified Replicates tibble to new .csv file.

##environment setup; change folder accordingly. Install tidyverse if needed.
#setwd("C:/Users/asmit/Desktop/pratice_files/pratice/")
##install.packages(tidyverse)

library(tidyverse)

### lists of all sets of files (lists are the same length)
singlet_files <- list.files(path = ".", pattern = "[^replicates]\\.tsv")
singlet_cleaned <- list.files(path = ".", pattern = "[_cleaned]\\.csv")
matching_pair_files <- list.files(path = ".", pattern = "[replicates]\\.tsv")


tibble_singlet <- function(x) { ###function to create tibble from singlet files
  cleanup_tibble <- as_tibble(read_tsv(x, col_names = TRUE, skip = 1))
}

singlet_cleanup <- function(x) { ##function to clean singlet files
  new_file <- str_replace(x, "(.*).tsv", "\\1_cleaned.csv")
  tibble_singlet(x) %>%
    select("Pos", "Name", "Cp", "Concentration") %>%
    .[str_order(.$Pos, numeric = TRUE),] %>%
    write_csv(file = new_file)
}

lapply(singlet_files, singlet_cleanup) ##run (singlet_cleanup) on files in singlet_files
#> list()

Second part

cleaned_tibble <- function(y) { ##function to read cleaned .csv files as tibble
  Pos_tibble <- as_tibble(read_csv(y, col_names = TRUE)) 
}

match <- function(m){ ##function to make tibble of replicate file
  match_tibble <- as_tibble(read_tsv(m, col_names = TRUE, skip = 1))
}

merged <- function(m,y){ ##function to merge match tibble with specific column of cleaned_tibble tibble
  organ <- regmatches(m, regexpr("(Liver|Lung|Kidney|Spleen)", m))
  output_file <- str_replace(m, "(.*)_replicates.tsv", "\\1_final.csv")
  match(m) %>%
    mutate("R1" = gsub(x = .$Samples, pattern = "^(.*),.*", replacement = "\\1")) %>%
    mutate("R2" = gsub(x = .$Samples, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
    pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>%
    select("MeanCp", "STD Cp", "Mean conc", "STD conc", "Wells") %>%
    relocate("Wells", 1) %>%
    right_join((cleaned_tibble(y)), by = c("Wells"="Pos")) %>%
    .[str_order(.$Wells, numeric = TRUE),] %>%
    select("Name", "MeanCp", "STD Cp", "Mean conc", "STD conc") %>%
    distinct(Name, .keep_all = TRUE) %>%
    add_column(Organ = organ) %>%
    write_csv(file = output_file)
}

map2(m=matching_pair_files, y=singlet_cleaned, merged(m,y))
#> Error in map2(m = matching_pair_files, y = singlet_cleaned, merged(m, : could not find function "map2"

Created on 2021-09-22 by the reprex package (v2.0.1)

Created on 2021-09-22 by the reprex package (v2.0.1)

allisonrs
  • 102
  • 8
  • 1
    Hi did you check your regular expression in `singlet_cleanup()` with an example `x`? If I run `str_replace("my.file.tsv", "[.*].tsv", "_cleaned.csv")`, the output is `"my.file.tsv"`, so you'd overwrite your original file! Perhaps you want to use regex groups like this: `str_replace("my.file.tsv", "(.*).tsv", "\\1_cleaned.csv")`, which gives `"my.file_cleaned.csv"`? – scrameri Sep 21 '21 at 21:17
  • @scrameri thanks for that regex tip! The example `x` worked once I changed it. However, when running `lapply(singlet_files, singlet_cleanup)` so it goes through all the values in my list in `singlet_files`, I get the error message. Possible that my list is not formatted correctly? It looks like this: `[1] "072621Liver1.tsv" "072621Liver2.tsv"` so it may only be pulling one of the files instead of looping through all of them. – allisonrs Sep 21 '21 at 21:38
  • 1
    Hi, I'm really not sure what's going on with that error. But where is the closing ', and why is there a "." at the end of your error message? `'C:\Users\asmit\Desktop\pratice_files\072621Liver1.tsv.)` Perhaps you could copy your code up to and including the `lapply` call into memory (Cmd+C or Ctrl+C), and then run reprex::reprex() and post the result? – scrameri Sep 21 '21 at 21:44
  • RE error message: Didn't copy and paste it correctly. It has been fixed. I also posted the corrected code incorporating your regex formatting. The first part is now working, but the second part isn't and I feel that it has something to do with the way I am calling in (map2, don't think it is correct though). Thanks for helping with this!! – allisonrs Sep 22 '21 at 14:16
  • AH I figured it out. I needed to use mapply. Will post an answer to my question. Thank you so much for putting me on the right track. Regex stuff helped a lot for sure. – allisonrs Sep 22 '21 at 15:14
  • ok great! Maybe check that `tidyverse` (and specifically `purrr`) packages are loaded for the last line with `map2` (which is from `purrr`), although I can't explain why the package isn't apparently loaded (you did `library(tidyverse)` at the beginning. – scrameri Sep 22 '21 at 20:45

1 Answers1

0

To get the second part to run, need to use mapply like in this example.

###import .tsv of PCR results, formatting, natural sort. Export cleaned file as .csv.
###import .tsv of Replicates file, split $Samples column in two and re-combine in Replicates.
###Export modified Replicates tibble to new .csv file.

###environment setup; change folder accordingly. Install tidyverse if needed.

setwd("C:/Users/asmit/Desktop/pratice_files")
#install.packages(tidyverse)

library(tidyverse)

###import .tsv of PCR results, formatting, natural sort. Export cleaned file as .csv.
singlet_files <- list.files(path = ".", pattern = "[^replicates]\\.tsv")

tibble_singlet <- function(x) { ###function to create tibble from singlet files
  cleanup_tibble <- as_tibble(read_tsv(x, col_names = TRUE, skip = 1))
}

singlet_cleanup <- function(x) { ##function to clean singlet files
  new_file <- str_replace(x, "(.*).tsv", "\\1_cleaned.csv")
  tibble_singlet(x) %>%
    select("Pos", "Name", "Cp", "Concentration") %>%
    .[str_order(.$Pos, numeric = TRUE),] %>%
    write_csv(file = new_file)
}

lapply(singlet_files, singlet_cleanup) ##run (singlet_cleanup) on files in singlet_files
#> Rows: 96 Columns: 8
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (3): Pos, Name, Status
#> dbl (4): Color, Cp, Concentration, Standard
#> lgl (1): Include
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 96 Columns: 8
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (3): Pos, Name, Status
#> dbl (4): Color, Cp, Concentration, Standard
#> lgl (1): Include
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 96 Columns: 8
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (3): Pos, Name, Status
#> dbl (4): Color, Cp, Concentration, Standard
#> lgl (1): Include
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> [[1]]
#> # A tibble: 96 x 4
#>    Pos   Name     Cp Concentration
#>    <chr> <chr> <dbl>         <dbl>
#>  1 A1    1E6    17.2    894000    
#>  2 A2    1E6    17.2    877000    
#>  3 A3    23     NA          NA    
#>  4 A4    23     NA          NA    
#>  5 A5    79     35.1         8.73 
#>  6 A6    79     36.2         4.26 
#>  7 A7    144    35.7         6.09 
#>  8 A8    144    36.7         3.19 
#>  9 A9    229    39.2         0.633
#> 10 A10   229    37.7         1.64 
#> # ... with 86 more rows
#> 
#> [[2]]
#> # A tibble: 96 x 4
#>    Pos   Name     Cp Concentration
#>    <chr> <chr> <dbl>         <dbl>
#>  1 A1    1E6    19.1     769000   
#>  2 A2    1E6    18.9     906000   
#>  3 A3    319    33.5        103   
#>  4 A4    319    33.8         86.3 
#>  5 A5    370    35.8         23.4 
#>  6 A6    370    40            1.79
#>  7 A7    415    35.6         27.2 
#>  8 A8    415    36.8         13   
#>  9 A9    486    34.5         55.3 
#> 10 A10   486    36.0         21.1 
#> # ... with 86 more rows
#> 
#> [[3]]
#> # A tibble: 96 x 4
#>    Pos   Name     Cp Concentration
#>    <chr> <chr> <dbl>         <dbl>
#>  1 A1    1E6    18.2     568000   
#>  2 A2    1E6    17.0    1210000   
#>  3 A3    23     35.7         12.3 
#>  4 A4    23     35.9         10.9 
#>  5 A5    67     35.6         13.3 
#>  6 A6    67     35.5         14.5 
#>  7 A7    129    38.3          2.6 
#>  8 A8    129    NA           NA   
#>  9 A9    172    NA           NA   
#> 10 A10   172    37.3          4.69
#> # ... with 86 more rows
###import .tsv of Replicates file, split $Samples column in two and re-combine in Replicates.
singlet_cleaned <- list.files(path = ".", pattern = "[_cleaned]\\.csv")
matching_pair_files <- list.files(path = ".", pattern = "[replicates]\\.tsv")

cleaned_tibble <- function(y) { ##function to read cleaned .csv files as tibble
  Pos_tibble <- as_tibble(read_csv(y, col_names = TRUE)) 
}

match <- function(m){ ##function to make tibble of replicate file
  match_tibble <- as_tibble(read_tsv(m, col_names = TRUE, skip = 1))
}

merged <- function(m,y){ ##function to merge match tibble with specific column of cleaned_tibble tibble
  organ <- regmatches(m, regexpr("(Liver|Lung|Kidney|Spleen)", m))
  output_file <- str_replace(m, "(.*)_replicates.tsv", "\\1_final.csv")
  match(m) %>%
    mutate("R1" = gsub(x = .$Samples, pattern = "^(.*),.*", replacement = "\\1")) %>%
    mutate("R2" = gsub(x = .$Samples, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
    pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>%
    select("MeanCp", "STD Cp", "Mean conc", "STD conc", "Wells") %>%
    relocate("Wells", 1) %>%
    right_join((cleaned_tibble(y)), by = c("Wells"="Pos")) %>%
    .[str_order(.$Wells, numeric = TRUE),] %>%
    select("Name", "MeanCp", "STD Cp", "Mean conc", "STD conc") %>%
    distinct(Name, .keep_all = TRUE) %>%
    add_column(Organ = organ) %>%
    write_csv(file = output_file) ###Export modified Replicates tibble to new .csv file.
}

mapply(merged, matching_pair_files, singlet_cleaned, SIMPLIFY = FALSE)
#> Rows: 47 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (1): Samples
#> dbl (4): MeanCp, STD Cp, Mean conc, STD conc
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 96 Columns: 4
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Pos, Name
#> dbl (2): Cp, Concentration
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 46 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (1): Samples
#> dbl (4): MeanCp, STD Cp, Mean conc, STD conc
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 48 Columns: 6
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Name, Organ
#> dbl (4): MeanCp, STD Cp, Mean conc, STD conc
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Error: Join columns must be present in data.
#> x Problem with `Pos`.

Though, I don't know what this final error message at the bottom is about... my files all have the output I expect. I'm... not going to worry about it for the time being.

Created on 2021-09-22 by the reprex package (v2.0.1)

allisonrs
  • 102
  • 8