0

My problem is actually related to bioinformatics and genetics but I see that it may be interesting for other programmers too.

As a background, I have lists of mutations, one file per patient sample which means that I have about two hundred individual files. I want to combine these lists and then compare these mutations between different patient groups.

All input files are in following list format;

#Variants in patient A:
Variant1 0.5
Variant2 0.7

#Variants in patient  B:
Variant2 0.3
Variant3 0.6

#Variants in patient  C:
Variant4 0.5

My problem is that all files do not contain same variables as variants may be unique and be presented only in one file. I would like to summarize these files and generate following output file;

           Patient A     Patient B      Patient C
Variant1   0.5           <NA>           <NA>
Variant2   0.7           0.3            <NA>
Variant3   <NA>          0.6            <NA>
Variant4   <NA>          <NA>           0.5

What I am asking is some tips how to generate this kind of output file in R, which I am the most familiar with. Any example scripts etc. would be highly appreciated!

THank you for your help!

PereG
  • 1,796
  • 2
  • 22
  • 23
Jokhe
  • 1
  • 1
  • Useful http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example It is not clear how are the inputs (what's inside the list?, I see 2? 4? elements en list "patient A"). Pls.edit :) – PereG Jan 10 '17 at 13:16
  • Each list contains always two columns; variant name (e.g. Variant1) and allele frequency (e.g. 0.5). The length of lists varies; some lists may contain ten variants while some lists are empty. – Jokhe Jan 10 '17 at 13:29
  • Can you check `str(patientA)` and `class(patientA)`-the name of the object in R that contains the information of a patient-. – PereG Jan 10 '17 at 13:38
  • @Jokhe If my answer solved your issue, could you please mark it as accepted? – Aurèle Mar 29 '17 at 11:23

2 Answers2

2
library(purrr)
library(stringr)
library(tidyr)
file1 <- "#Variants in patient A:
Variant1 0.5
Variant2 0.7"
file2 <- "#Variants in patient  B:
Variant2 0.3
Variant3 0.6"
file3 <- "#Variants in patient  C:
Variant4 0.5"
files <- paste0("file", 1:3)
files %>% 
  map(~ {
    patient <- str_extract(readLines(con = textConnection(get(.x)), n = 1L), pattern = "patient\\h+\\w+")
    data <- read.table(file = textConnection(get(.x)), skip = 1L, stringsAsFactors = FALSE, col.names = c("variant", "value"))
    cbind(data, patient)
  }) %>% 
  do.call(what = "rbind") %>% 
  spread(key = patient, value = value)
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • Since you seem to favor `reshape2` (missed the tag at first) you can replace the last line with `reshape2::dcast(variant ~ patient)`. I personally prefer the more modern `tidyr` package. – Aurèle Jan 10 '17 at 14:03
-1

Assuming your example data was a portion of one file, read in the file with readLines, identify the id lines with grep and then loop over these.

text <- readLines("myfile.txt")

patients <- grep("#", text)

plyr::ldply(1:length(patients), function(i){print(i)
  start <- patients[i]
  end <- c(patients[-1], length(text) + 1)[i]
  x <- read.table("myfile.txt", skip = start, nrows = end - start - 1, comment.char = "", blank.lines.skip = FALSE)
  names(x) <- c("variant", "value")
  x$patient <- gsub("^.*patient\\s+(.*):$", "\\1", text[start])
  x
})

Filter out the rows with NA and then use tidyr::spread if you really want th data in the format you use above rather than a tidy format.

Richard Telford
  • 9,558
  • 6
  • 38
  • 51