creating a matrix using python for biologist

Question

I am asking this question is in general for many biologist/bioinformatics researchers who find it difficult to construct a matrix from their gene expression data, I tried googling and finding answers I am surprised there are not many of them addressing this problem in particular , I have asked the same in the past but it was not executable, here is the typical problem

there would be different files with rows with a gene_id and columns with score and other meta information e.g sample1 typically will have 200000 rows

gene_id score metainfo1 metainfo2
gene1   20  constitutive donor
gene2   30  alternative  acceptor

ideally for downstream analysis biologists always would want to build a matrix where first collect all the gene_ids from all files and place it in column 1 and append scores form each file gene_id and where the score is not available add a '0', something like this and keep the column name for score as filename (metainfo can be optional sometimes it may be required)

gene_id score_sample1 score_sample2....score_samplen metainfo1 metainfo2

If any one can contribute a step by step procedure using python that can be dynamically applied It will be of great help to biologists with skewed programming knowledge.

unique_id col1 col2 col3 score col5 col6 col7 col8 col9 col10 col11 col12 col13 col14

have 20 files with this data need to make a matrix (col is metainfo) with just

unique_id(from all files) score col3 col4 col7 col9 col14

Thanks.

Welcome to stackoverflow! Please have a look at https://stackoverflow.com/help/how-to-ask Unfortunately I guess that your question is too broad to get an answer here. — Maximilian Peters, Apr 18 '17 at 21:46
Hi @Maximilian thanks for your reply, I am not sure where else to go if this does not get answered here, please see this question, http://stackoverflow.com/questions/40690081/create-matrix-using-python I tried doing this on my data and it does not seem to work. Thank you — novicebioinforesearcher, Apr 18 '17 at 21:58
It's not entirely clear what you are trying to do. My best guess is you want to merge multiple files. Use Pandas. Google it. Read each file into a Pandas DataFrame. Join/merge the Dataframes on the common data element which should be gene id. — Steve, Apr 19 '17 at 12:28
@Chris_Rands thanks for the tip will keep in mind next time, stack overflow has a bioinformatics tag, thats what made me post it here, not sure why was this given a negative vote. — novicebioinforesearcher, Apr 19 '17 at 15:57
@novicebioinforesearcher I think negative votes may be due to your question not being very clear (I didn't downvote). I think you could improve your question by including a mini-example input (at least two files) and the full expected output corresponding to this example input, not just things that look like "abstract" header lines. — bli, Apr 21 '17 at 13:35
@novicebioinforesearcher You can still edit the current question to make it clearer. Based on my attempt at answering (initial example files and final table), does it seem that I understood what you want to achieve? — bli, Apr 21 '17 at 15:05

bli · Accepted Answer · 2017-08-24T08:51:59.493

Suppose we have these two files:

$ cat sample1.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene2   30  alternative acceptor
$ cat sample2.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene3   30  alternative acceptor

You can read the data using pandas dataframes.

import pandas as pd
sample1 = pd.read_table("sample1.txt", index_col=0)["score"]
sample2 = pd.read_table("sample2.txt", index_col=0)["score"]

Merge it "horizontally" (axis=1) and change missing values to 0:

concatenated = pd.concat([sample1, sample2], axis=1).fillna(0)

Set new column names:

concatenated.columns = ["score_sample1", "score_sample2"]

Now we can extract the meta-information (all lines, last two columns):

meta1 = pd.read_table("sample1.txt", index_col=0).iloc[:,-2:]
meta2 = pd.read_table("sample2.txt", index_col=0).iloc[:,-2:]

Merge it "vertically" (default "axis" parameter is 0):

meta = pd.concat([meta1, meta2])

Remove duplicate lines (https://stackoverflow.com/a/34297689/1878788)

meta = meta[~meta.index.duplicated(keep="first")]

Concatenate it "horizontally" to the scores:

concatenated = pd.concat([concatenated, meta], axis=1)

And we obtain this:

         score_sample1  score_sample2     metainfo1 metainfo2
gene_id                                                      
gene1             20.0           20.0  constitutive     donor
gene2             30.0            0.0   alternative  acceptor
gene3              0.0           30.0   alternative  acceptor

Addendum (24/08/2017): With more files

Suppose you have actually 20 sample*.txt files.

You can probably generalize the above method by generating lists of DataFrames as follows:

import pandas as pd
filenames = ["sample%d" % n for n in range(1,21)]
samples = [pd.read_table(filename, index_col=0)["score"] for filename in filenames]
concatenated = pd.concat(samples, axis=1).fillna(0)
concatenated.columns = ["score_sample%d" % n for n in range(1, 21)]
metas = [pd.read_table(filename, index_col=0).iloc[:,-2:] for filename in filenames]
meta = pd.concat(metas)
meta = meta[~meta.index.duplicated(keep="first")]
concatenated = pd.concat([concatenated, meta], axis=1)

`ValueError: cannot reindex from a duplicate axis` is the error that i get — novicebioinforesearcher, Apr 19 '17 at 16:42
@novicebioinforesearcher For me the exact code I posted works with the data I showed. I'm using python3, but I think this shouldn't make a difference. At what step does the problem occur ? — bli, Apr 20 '17 at 07:59
i have edited the question, as I am not sure how to add it to comment — novicebioinforesearcher, Apr 20 '17 at 14:47
`pd.concat` takes a list of `DataFrame`s, so it can be `pd.concat([sample1, sample2, sample3, ...], axis=1)` and `pd.concat([meta1, meta2, meta3, ...])`. — bli, Aug 24 '17 at 08:35

creating a matrix using python for biologist

1 Answers1

Addendum (24/08/2017): With more files