-1

I am asking this question is in general for many biologist/bioinformatics researchers who find it difficult to construct a matrix from their gene expression data, I tried googling and finding answers I am surprised there are not many of them addressing this problem in particular , I have asked the same in the past but it was not executable, here is the typical problem

there would be different files with rows with a gene_id and columns with score and other meta information e.g sample1 typically will have 200000 rows

gene_id score metainfo1 metainfo2
gene1   20  constitutive donor
gene2   30  alternative  acceptor 

ideally for downstream analysis biologists always would want to build a matrix where first collect all the gene_ids from all files and place it in column 1 and append scores form each file gene_id and where the score is not available add a '0', something like this and keep the column name for score as filename (metainfo can be optional sometimes it may be required)

gene_id score_sample1 score_sample2....score_samplen metainfo1 metainfo2

If any one can contribute a step by step procedure using python that can be dynamically applied It will be of great help to biologists with skewed programming knowledge.

unique_id col1 col2 col3 score col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 

have 20 files with this data need to make a matrix (col is metainfo) with just

unique_id(from all files) score col3 col4 col7 col9 col14

Thanks.

bli
  • 7,549
  • 7
  • 48
  • 94
  • 2
    Welcome to stackoverflow! Please have a look at https://stackoverflow.com/help/how-to-ask Unfortunately I guess that your question is too broad to get an answer here. – Maximilian Peters Apr 18 '17 at 21:46
  • Hi @Maximilian thanks for your reply, I am not sure where else to go if this does not get answered here, please see this question, http://stackoverflow.com/questions/40690081/create-matrix-using-python I tried doing this on my data and it does not seem to work. Thank you – novicebioinforesearcher Apr 18 '17 at 21:58
  • could you please help me rephrase the question thanks – novicebioinforesearcher Apr 18 '17 at 23:19
  • 1
    Try biostars for this https://www.biostars.org/ – Chris_Rands Apr 19 '17 at 08:45
  • 1
    It's not entirely clear what you are trying to do. My best guess is you want to merge multiple files. Use Pandas. Google it. Read each file into a Pandas DataFrame. Join/merge the Dataframes on the common data element which should be gene id. – Steve Apr 19 '17 at 12:28
  • @Chris_Rands thanks for the tip will keep in mind next time, stack overflow has a bioinformatics tag, thats what made me post it here, not sure why was this given a negative vote. – novicebioinforesearcher Apr 19 '17 at 15:57
  • @novicebioinforesearcher I think negative votes may be due to your question not being very clear (I didn't downvote). I think you could improve your question by including a mini-example input (at least two files) and the full expected output corresponding to this example input, not just things that look like "abstract" header lines. – bli Apr 21 '17 at 13:35
  • @bli will make a note of it next time I post – novicebioinforesearcher Apr 21 '17 at 14:19
  • @novicebioinforesearcher You can still edit the current question to make it clearer. Based on my attempt at answering (initial example files and final table), does it seem that I understood what you want to achieve? – bli Apr 21 '17 at 15:05

1 Answers1

2

Suppose we have these two files:

$ cat sample1.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene2   30  alternative acceptor
$ cat sample2.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene3   30  alternative acceptor

You can read the data using pandas dataframes.

import pandas as pd
sample1 = pd.read_table("sample1.txt", index_col=0)["score"]
sample2 = pd.read_table("sample2.txt", index_col=0)["score"]

Merge it "horizontally" (axis=1) and change missing values to 0:

concatenated = pd.concat([sample1, sample2], axis=1).fillna(0)

Set new column names:

concatenated.columns = ["score_sample1", "score_sample2"]

Now we can extract the meta-information (all lines, last two columns):

meta1 = pd.read_table("sample1.txt", index_col=0).iloc[:,-2:]
meta2 = pd.read_table("sample2.txt", index_col=0).iloc[:,-2:]

Merge it "vertically" (default "axis" parameter is 0):

meta = pd.concat([meta1, meta2])

Remove duplicate lines (https://stackoverflow.com/a/34297689/1878788)

meta = meta[~meta.index.duplicated(keep="first")]

Concatenate it "horizontally" to the scores:

concatenated = pd.concat([concatenated, meta], axis=1)

And we obtain this:

         score_sample1  score_sample2     metainfo1 metainfo2
gene_id                                                      
gene1             20.0           20.0  constitutive     donor
gene2             30.0            0.0   alternative  acceptor
gene3              0.0           30.0   alternative  acceptor

Addendum (24/08/2017): With more files

Suppose you have actually 20 sample*.txt files.

You can probably generalize the above method by generating lists of DataFrames as follows:

import pandas as pd
filenames = ["sample%d" % n for n in range(1,21)]
samples = [pd.read_table(filename, index_col=0)["score"] for filename in filenames]
concatenated = pd.concat(samples, axis=1).fillna(0)
concatenated.columns = ["score_sample%d" % n for n in range(1, 21)]
metas = [pd.read_table(filename, index_col=0).iloc[:,-2:] for filename in filenames]
meta = pd.concat(metas)
meta = meta[~meta.index.duplicated(keep="first")]
concatenated = pd.concat([concatenated, meta], axis=1)
bli
  • 7,549
  • 7
  • 48
  • 94