How to combine files and match them with their identifier from a separate file?

Question

I have 500 txt files all under the same folder. Each text file represents a patient and has a list of genes (miRNA genes in this example) and their corresponding expression values. I am only interested in the reads_per_million_miRNA_mapped for each corresponding miRNA_ID. Below is an example of three:

File name: 0a4af8c8.mirnas.quantification.txt

  miRNA_ID         read_count   reads_per_million_miRNA_mapped  cross.mapped
1 hsa-let-7a-1     39039        5576.681                        N
2 hsa-let-7a-2     38985        5568.967                        Y
3 hsa-let-7a-3     38773        5538.684                        N

File name: 0a867fd6.mirnas.quantification.txt

miRNA_ID           read_count   reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1     36634        11413.6842                     N
2 hsa-let-7a-2     36608        11405.5837                     N
3 hsa-let-7a-3     36006        11218.0246                     N

File name: 0ac65c4b.mirnas.quantification.txt

miRNA_ID         read_count   reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1      68376       14254.3693                     N
2 hsa-let-7a-2      67965       14168.6880                     Y
3 hsa-let-7a-3      67881       14151.1765                     N

While each file has a unique name, the name does not tell me the patient's ID, and there is nothing in the file which directly tells me the patient's ID. To determine the patient's ID, I use a separate master CSV file which includes a row of all patient ID's and there corresponding file name for the txt files. This csv file has way to many columns for me to post an example row so I only have the two columns of interest listed below.

file_name                            patient_id
0a4af8c8.mirnas.quantification.txt   TCGA-G9-6373-01A
0a867fd6.mirnas.quantification.txt   TCGA-XJ-A9DX-01A
0ac65c4b.mirnas.quantification.txt   TCGA-V1-A9OF-01A

My goal is to create a data frame of all combined txt files which has the gene expression data for all patients for all genes

miRNA_ID       TCGA-G9-6373-01A   TCGA-XJ-A9DX-01A   TCGA-V1-A9OF-01A
hsa-let-7a-1   5576.681           11413.6842         14254.3693
hsa-let-7a-2   5568.967           11405.5837         14168.6880
hsa-let-7a-3   5538.684           11218.0246         14151.1765

I have figured out a way to do this by subsetting the file name and patient ID into a new data frame and then using a for loop to combine all the txt files and add on an additional column with the file name to get to each file. I then use the left_join function from the tidyverse package to combine the data frames. While this works, it is not resource efficient as I am creating extra data frames and columns which I do not need. I was wondering if anyone knows of a better approach which can do the same thing in one goal. For example by using a which function within the for loop that can be used to rename the Expression_value column as the patient ID by associating the file going through the loop with the patient ID from the same row in the separate master CSV file. Thanks in advance.

Here is the link to the previous method I used.

How to create a data frame in R where I have to associate different txt files with a sample ID from a separate file?

I think you will get **much** better and faster answers if you make a copy/pasteable reproducible example. [See here for lots of tips for creating reproducible examples in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Providing an example will also allow answerers to benchmark and compare results. [As a baseline, you might want to link to your existing solution (I assume it is the one provided here)](https://stackoverflow.com/q/47639080/903061). — Gregor Thomas, Dec 12 '17 at 23:40
For example, I have an idea of how I would approach the problem. But creating sample data to test and debug the solution sounds boring, so I won't attempt to answer unless there's sample data I can copy/paste into my R session. — Gregor Thomas, Dec 12 '17 at 23:43
Thanks for your comments Gregor. I am a bench scientist only beginning to build my bioinformatics background. I have revised the question. Hopefully this is more sufficient. — Austin McDermott, Dec 13 '17 at 04:31

score 0 · Answer 1 · answered Dec 13 '17 at 00:15

Without your actual data it is very challenging to attempt to answer this, so hopefully this will be a useful design pattern. You will need a two things:

1) An identifying pattern that you can construct based on the file name and merge with the master 2) All of the files in the working directory

Here is what I would recommend:

library(data.table)
library(magrittr)
library(stringr)

setwd("path/to/directory")

# Probably implement some kind of regex on the file name
# to extract the patient name
read_file <- function(file_name){
  fread(file_name) %>% 
    .[,patient_name := str_replace_all(file_name,"regex_string","")]
}

all_files <- list.files(pattern = "file_pattern")

master <- fread("path/to/master")

combined_files <- lapply(all_files, read_file) %>% 
  rbindlist %>% 
  merge(master, by = "patient_name")

Essentially this sets the working directory to where your files are, implements a parser which grabs the patient name to match to the master, applies that parser to all the files, combines them to a single data frame with the identifying observation, and then merges them with the master. Hopefully it helps!

With this code, I get the error: Error in eval(lhs, parent.frame(), parent.frame()) : object 'cases' not found. I have updates the question to be more specific. Also, I am not trying to merge with the master. I am trying to match all the files to their patient name in the master, and then merge all the files together. — Austin McDermott, Dec 14 '17 at 01:05

score 0 · Answer 2 · answered Dec 16 '17 at 00:06

This should work. You'll need to customize the input_folder (or set your working directory there and delete the references to it in my code). I'm calling the data frame with the patient IDs and file names filekey.

library(data.table)

input_folder = "path/to/folder/"
cols_to_keep = c("miRNA_ID", "reads_per_million_miRNA_mapped")
files = lapply(paste0(input_folder, "filekey$file_name"), fread, select = cols_to_keep)

names(files) = filekey$patient_id
long = rbindlist(files, id = T)
result = dcast(long, miRNA_ID ~ .id, value.var = "reads_per_million_miRNA_mapped")
result
#        miRNA_ID TCGA-G9-6373-01A TCGA-V1-A9OF-01A TCGA-XJ-A9DX-01A
# 1: hsa-let-7a-1         5576.681         14254.37         11413.68
# 2: hsa-let-7a-2         5568.967         14168.69         11405.58
# 3: hsa-let-7a-3         5538.684         14151.18         11218.02

How to combine files and match them with their identifier from a separate file?

2 Answers2