Generating summarized tables from lists (R)

Question

My problem is actually related to bioinformatics and genetics but I see that it may be interesting for other programmers too.

As a background, I have lists of mutations, one file per patient sample which means that I have about two hundred individual files. I want to combine these lists and then compare these mutations between different patient groups.

All input files are in following list format;

#Variants in patient A:
Variant1 0.5
Variant2 0.7

#Variants in patient  B:
Variant2 0.3
Variant3 0.6

#Variants in patient  C:
Variant4 0.5

My problem is that all files do not contain same variables as variants may be unique and be presented only in one file. I would like to summarize these files and generate following output file;

           Patient A     Patient B      Patient C
Variant1   0.5           <NA>           <NA>
Variant2   0.7           0.3            <NA>
Variant3   <NA>          0.6            <NA>
Variant4   <NA>          <NA>           0.5

What I am asking is some tips how to generate this kind of output file in R, which I am the most familiar with. Any example scripts etc. would be highly appreciated!

THank you for your help!

Useful http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example It is not clear how are the inputs (what's inside the list?, I see 2? 4? elements en list "patient A"). Pls.edit :) — PereG, Jan 10 '17 at 13:16
Each list contains always two columns; variant name (e.g. Variant1) and allele frequency (e.g. 0.5). The length of lists varies; some lists may contain ten variants while some lists are empty. — Jokhe, Jan 10 '17 at 13:29
Can you check `str(patientA)` and `class(patientA)`-the name of the object in R that contains the information of a patient-. — PereG, Jan 10 '17 at 13:38
@Jokhe If my answer solved your issue, could you please mark it as accepted? — Aurèle, Mar 29 '17 at 11:23

Aurèle · Answer 1 · 2017-01-10T14:13:03.253

2

library(purrr)
library(stringr)
library(tidyr)
file1 <- "#Variants in patient A:
Variant1 0.5
Variant2 0.7"
file2 <- "#Variants in patient  B:
Variant2 0.3
Variant3 0.6"
file3 <- "#Variants in patient  C:
Variant4 0.5"
files <- paste0("file", 1:3)
files %>% 
  map(~ {
    patient <- str_extract(readLines(con = textConnection(get(.x)), n = 1L), pattern = "patient\\h+\\w+")
    data <- read.table(file = textConnection(get(.x)), skip = 1L, stringsAsFactors = FALSE, col.names = c("variant", "value"))
    cbind(data, patient)
  }) %>% 
  do.call(what = "rbind") %>% 
  spread(key = patient, value = value)

edited Jan 10 '17 at 14:13

answered Jan 10 '17 at 13:44

Aurèle

12,545
1
31
49

Since you seem to favor `reshape2` (missed the tag at first) you can replace the last line with `reshape2::dcast(variant ~ patient)`. I personally prefer the more modern `tidyr` package. – Aurèle Jan 10 '17 at 14:03

score -1 · Answer 2 · answered Jan 10 '17 at 14:16

Assuming your example data was a portion of one file, read in the file with readLines, identify the id lines with grep and then loop over these.

text <- readLines("myfile.txt")

patients <- grep("#", text)

plyr::ldply(1:length(patients), function(i){print(i)
  start <- patients[i]
  end <- c(patients[-1], length(text) + 1)[i]
  x <- read.table("myfile.txt", skip = start, nrows = end - start - 1, comment.char = "", blank.lines.skip = FALSE)
  names(x) <- c("variant", "value")
  x$patient <- gsub("^.*patient\\s+(.*):$", "\\1", text[start])
  x
})

Filter out the rows with NA and then use tidyr::spread if you really want th data in the format you use above rather than a tidy format.

Generating summarized tables from lists (R)

2 Answers2