I am asking this question is in general for many biologist/bioinformatics researchers who find it difficult to construct a matrix from their gene expression data, I tried googling and finding answers I am surprised there are not many of them addressing this problem in particular , I have asked the same in the past but it was not executable, here is the typical problem
there would be different files with rows with a gene_id and columns with score and other meta information e.g sample1 typically will have 200000 rows
gene_id score metainfo1 metainfo2
gene1 20 constitutive donor
gene2 30 alternative acceptor
ideally for downstream analysis biologists always would want to build a matrix where first collect all the gene_ids from all files and place it in column 1 and append scores form each file gene_id and where the score is not available add a '0', something like this and keep the column name for score as filename (metainfo can be optional sometimes it may be required)
gene_id score_sample1 score_sample2....score_samplen metainfo1 metainfo2
If any one can contribute a step by step procedure using python that can be dynamically applied It will be of great help to biologists with skewed programming knowledge.
unique_id col1 col2 col3 score col5 col6 col7 col8 col9 col10 col11 col12 col13 col14
have 20 files with this data need to make a matrix (col is metainfo) with just
unique_id(from all files) score col3 col4 col7 col9 col14
Thanks.