I have 500 txt files all under the same folder. Each text file represents a patient and has a list of genes (miRNA genes in this example) and their corresponding expression values. I am only interested in the reads_per_million_miRNA_mapped for each corresponding miRNA_ID. Below is an example of three:
File name: 0a4af8c8.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 39039 5576.681 N
2 hsa-let-7a-2 38985 5568.967 Y
3 hsa-let-7a-3 38773 5538.684 N
File name: 0a867fd6.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 36634 11413.6842 N
2 hsa-let-7a-2 36608 11405.5837 N
3 hsa-let-7a-3 36006 11218.0246 N
File name: 0ac65c4b.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 68376 14254.3693 N
2 hsa-let-7a-2 67965 14168.6880 Y
3 hsa-let-7a-3 67881 14151.1765 N
While each file has a unique name, the name does not tell me the patient's ID, and there is nothing in the file which directly tells me the patient's ID. To determine the patient's ID, I use a separate master CSV file which includes a row of all patient ID's and there corresponding file name for the txt files. This csv file has way to many columns for me to post an example row so I only have the two columns of interest listed below.
file_name patient_id
0a4af8c8.mirnas.quantification.txt TCGA-G9-6373-01A
0a867fd6.mirnas.quantification.txt TCGA-XJ-A9DX-01A
0ac65c4b.mirnas.quantification.txt TCGA-V1-A9OF-01A
My goal is to create a data frame of all combined txt files which has the gene expression data for all patients for all genes
miRNA_ID TCGA-G9-6373-01A TCGA-XJ-A9DX-01A TCGA-V1-A9OF-01A
hsa-let-7a-1 5576.681 11413.6842 14254.3693
hsa-let-7a-2 5568.967 11405.5837 14168.6880
hsa-let-7a-3 5538.684 11218.0246 14151.1765
I have figured out a way to do this by subsetting the file name and patient ID into a new data frame and then using a for loop to combine all the txt files and add on an additional column with the file name to get to each file. I then use the left_join function from the tidyverse package to combine the data frames. While this works, it is not resource efficient as I am creating extra data frames and columns which I do not need. I was wondering if anyone knows of a better approach which can do the same thing in one goal. For example by using a which function within the for loop that can be used to rename the Expression_value column as the patient ID by associating the file going through the loop with the patient ID from the same row in the separate master CSV file. Thanks in advance.
Here is the link to the previous method I used.