0

I have 2 data sets I am working with in R. In the first, I have multiple txt files with expression values for different genes. Each file has the same column and row names.

gene_ID   expression_value
gene_1    expression_value_1
...       ...

In the second, I have a master chart (csv file) which associates the name of each txt file with a patient ID.

name_txt_file    patient_ID
txt_file_1       patient_1

I am trying to create a master file with the gene expression values for all patients for each gene.

patient_ID      gene_1                 gene_2   ...
patient_1       expression_value_1     expression_value_2
patient_2       expression_value_x     expression_value_y

So far I have created an empty data frame with the correct column and row names, but I do not know how to associate the name of each txt file with a patient ID from the master chart (csv file) and fill in the expression values for this empty data frame. I am assuming some sort of for loop function could be used, but do not know how to write functions which will associate the data in a file with a patient ID based on the file's name. Any help would be greatly appreciated.

  • I'm confused that you say you have 2 data sets, but multiple txt files. Aren't those data sets? In the end, is there one row per patient, and as many columns as there were rows in that patient's corresponding text file? – Gregor Thomas Dec 04 '17 at 17:54
  • Yes, the multiple txt files are each a data set for each patient. In the end, there is one row per patient, the columns are each gene that was measured, and the values inputted are expression values. The txt files consist of the same columns of information such as gene ID and gene expression value. The problem I have is that each patient has its own txt file and all the gene IDs in these txt files are the same so I have to associate each txt file based off the file's name to a patient ID via a separate file (which has patient IDs and txt file names) before combining all the data together. – Austin McDermott Dec 04 '17 at 22:18

1 Answers1

1

make sure yout .txt-filea are readable in R (I prefer csv) enter image description here enter image description here

Then i use code like this:

df.files <- data.frame( filename = list.files( path = "./data", pattern="*.txt" ) )
df.files["filepath"] <- paste0( getwd(), "/data/", df.files$filename )

df1 <- data.frame( gene_ID = character(0), 
                   expression_value = character(0) )

for ( f in df.files$filepath ) {
  df.temp <- read.csv2(f)
  filename <- gsub(".*/","",f)
  df.temp["filename"] <- strtrim( filename, nchar( filename ) - 4 )
  df1 <- rbind( df1, df.temp )
}

df2 <- data.frame( filename = c( "text_1", "text_2" ), 
                   patiend_ID = c( "patient_1", "patient_2" ), 
                   stringsAsFactors = FALSE )

require(tidyverse)
df.total <- df1 %>%
  left_join( df2, by = "filename" ) %>%
  spread( gene_ID, expression_value ) %>%
  select( -filename )

Which leads to this: enter image description here

Wimpel
  • 26,031
  • 1
  • 20
  • 37