Automated combinatorial DataFrame generation in Python/pandas

Question

I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.

For instance, if I have a DataFrame containing a given set of combinations: https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.

Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.

I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.

What will you use this for? Are you ultimately going to use this to construct a design matrix? — Phillip Cloud, Sep 08 '13 at 08:24
I'm not familiar with the concept of a design matrix. My plan is to create a script to take input conditions that will then assign a space-efficient physical layout of all needed samples in a 96 or 384 well plate (12x8 or 24x16 wells, respectively). Then do the experiment and append the DataFrame with a new column of results for each row. That will be used to output the data in concentration-response curves for each protein*drug pair... all in an automated fashion. — Colin Higgins, Sep 08 '13 at 17:47

score 3 · Answer 1 · answered Feb 02 '18 at 02:14

Consider the following solution

from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]

df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )

>>> df
     protein   drug  drug_concentration
0   protein1  drug1                 100
1   protein1  drug1                  30
2   protein1  drug1                  10
3   protein1  drug2                 100
4   protein1  drug2                  30
5   protein1  drug2                  10
6   protein1  drug3                 100
7   protein1  drug3                  30
8   protein1  drug3                  10
9   protein2  drug1                 100
10  protein2  drug1                  30
11  protein2  drug1                  10
12  protein2  drug2                 100
13  protein2  drug2                  30
14  protein2  drug2                  10
15  protein2  drug3                 100
16  protein2  drug3                  30
17  protein2  drug3                  10
18  protein3  drug1                 100
19  protein3  drug1                  30
20  protein3  drug1                  10
21  protein3  drug2                 100
22  protein3  drug2                  30
23  protein3  drug2                  10
24  protein3  drug3                 100
25  protein3  drug3                  30
26  protein3  drug3                  10

This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.

score 0 · Answer 2 · edited May 23 '17 at 10:31

I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.

The code is:

data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
    dict1  = {"protein":unique_protein,"drug":"","drug_concentration":""}
    rows_list.append(dict1)
    for unique_drug in data.drug.unique():
        for unique_drug_conc  in data.drug_concentration.unique():
            dict1  = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
            rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df

It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.

score 0 · Answer 3 · answered Sep 09 '13 at 00:35

I've improved upon the earlier version

enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.

I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.

Code for a fully functional python file follows:

import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)

def null_check(value):
    return pandas.isnull(value)

def combinator(input_table):
    for unique_protein in input_table.protein.unique():
        dict1 = dict.fromkeys(realinput.columns,"")
        dict1['protein']=unique_protein
        rows_list.append(dict1)
        if prot_drug_bool.ix[unique_protein]:
            for unique_drug in input_table.drug.unique():
                if not null_check(unique_drug):
                    for unique_drug_conc in input_table.drug_concentration.unique():
                        if not null_check(unique_drug_conc):
                            dict1 = dict.fromkeys(realinput.columns,"")
                            dict1['protein']=unique_protein
                            dict1['drug']=unique_drug                        
                            dict1['drug_concentration']=unique_drug_conc
                            rows_list.append(dict1)
    df = pandas.DataFrame(rows_list)
    return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')

I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.

Automated combinatorial DataFrame generation in Python/pandas

3 Answers3