0

I am using R and I want to automate this process since I will be doing it to a large number of files. All the files have the same format (.vcf) and I want to change that to a small data.frame. I have used

library(dplyr)
library(tidyr)

df <- read.table("DNA_rep1.vcf", sep="\t")
df$chromosome <- sub(pattern = "chr",replacement = "", df[,1])
df <- df %>% separate(V8,c("8.1", "8.2","8.3"), extra='drop')
df <- df[,c(13,2,9,6)]
df <- cbind(rep(29, nrow(df)), df)

Here the 29 ideally would be an argument of a function so I can change the sample ID when applying the function.

colnames(df) <- c("sample", "chromosome", "start", "end", "segVal")

Also, it would be good to add the generated data.frame to the previously curated data, for example, prev.df <- rbind (prev.df, df) and so and so on with all the files.

QUESTION SOLVED

library(dplyr)
library(tidyr)
library(string)

#I created a function first

vcf2cnv.df <- function(x){
  a <- read.table(x, sep="\t")
  #Only show chromosome number not "chr"
  a$chromosome <- sub(pattern = "chr", replacement = "", a[,1])
  #Extract the end position
  a <- a %>% separate(V8,c("8.1", "8.2","8.3"), extra='drop')
  #Keep the columns I need
  a <- a %>% select(c(13,2,9,6))
  #Extract the number from the file to create an ID
  y <- as.numeric((str_extract_all(x, pattern = "[0-9]", simplify = TRUE))) 
  y <- paste(y, collapse = "")
  y <- as.numeric(gsub('.{1}$', '', y))
  a <- cbind(rep(y, nrow(a)), a)
  #Set column names
  colnames(a) <- c("sample", "chromosome", "start", "end", "segVal")
  #Save file in working directory
  write.csv(a, file = paste0(y, "_DNA_CopyNumberVariants.csv"))
                             
}

##Now let's run this function to all files and combine them.

#Set Working Directory
setwd("/my/working/directory")

# Apply the function  to all the files
file_vcf <- list.files(pattern = "*.vcf", full.names = TRUE)
lapply(file_vcf, vcf2cnv.df)

#Bind all the results in a single data.frame
file_csv <- list.files(pattern = "*.csv", full.names = TRUE)

for (file in file_csv){
  
  # if the merged dataset doesn't exist, create it
  if (!exists("Colon_cnv")){
    Colon_cnv <- read.csv(file, header=TRUE)
  }
  
  # if the merged dataset does exist, append to it
  if (exists("Colon_cnv")){
    temp_dataset <-read.csv(file, header=TRUE)
    Colon_cnv<-rbind(Colon_cnv, temp_dataset)
    rm(temp_dataset)
  }
}
Pexav01
  • 45
  • 5
  • 2
    FYI, inline code uses a *single* backtick, such as `\`data.frame\``; multiple (or long) lines of code should really be in a code block, denoted with code "fences": three backticks *on their own line*, as in `\`\`\``; the first may have a language-hint on it, as in `\`\`\`r`, but that is not required. (Inline code using `\`\`\`` and no newline is mostly treated like a single backtick, with rare exceptions.) See the edit I suggested to see how this looks and flows. For this and more formatting hints, see https://stackoverflow.com/editing-help and https://meta.stackexchange.com/a/22189. – r2evans Nov 02 '21 at 20:44
  • You might want to have a look at my answer to [How do I make a list of data frames?](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) It's a good starting place for this problem. – Gregor Thomas Nov 02 '21 at 20:46
  • `VCF` is a common format in bioinformatics, it is also reasonably complex for you to not try to parse it yourself. Consider using a package that implements VCF parsing. See this previous question: https://stackoverflow.com/questions/21598212/extract-sample-data-from-vcf-files – Colombo Nov 02 '21 at 21:10

0 Answers0