1

Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.

I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:

                 TF-A       TF-A     TF-B       TF-B     
ids  gene name  score sum   hits    score sum   hits  ...  gene description  
id1   gene A    53.85         14       37.65      7   ...   stuff
id2   gene B    97.55         11       63.94      8   ...   stuff
id3   gene C    88.67         9        79.43      12  ...   stuff
id4   gene D    69.35         12       13.03      13  ...   stuff
...   ......    ......        ...       ...      ...  ..........
idx   gene Z    49.32         8        84.03      10  ...   stuff

The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.

for id in common_names:
    gene_name= idconverter.getgene(id).gene_name
    gene_description= idconverter.getgene(id).description
    df.set_value(id,'gene name', gene_name)
    df.set_value(id,'gene name', gene_description)

However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.

So, before I build the dataframe, I make a list that expands for every TF given.

ColumnList = []
for tf_id in tf_id_list:    
    ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]

With this list, I concatenate it with the other column names, and then instantiate my dataframe.

df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description']) 

As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?

I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):

Modifying a subset of rows in a pandas dataframe

df.ix[df.A==0, 'B'] = np.nan

If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:

for id in common_genes:
    for tf_id in tf_id_list:
        df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
        df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores    

I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.

Bob McBobson
  • 743
  • 1
  • 9
  • 29

1 Answers1

0

In the end, I decided to make a dict of dicts– I realized that this data structure was actually the most ideal for what I needed. The information from the original dicts would be stored as inside a dict (the "total_dict"), and those would be accessed if and only if the gene was present in the list of common_genes (which was derived from a set that all the transcription factors had in common). They are accessed by setting a for loop to go through the TFs of the total_dict to determine whether the gene id was present in the TF dict, and if it was, the information in the value (i.e. sum of scores and number of hits) was added to the correct row (based on the gene name, or id) and column (based on the TF at hand).

for id in common_names:
    gene_name= idconverter.getgene(id).gene_name
    gene_description= idconverter.getgene(id).description
    df.set_value(id,'gene name', gene_name)
    df.set_value(id,'gene name', gene_description)
    for tf_id in total_dict.keys():
        if id in total_dict[tf_id].keys():
            df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][0])
            df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][1])
print(df.header())

Long story short, if you are dealing with a lot of different kinds of data, make SURE the kind of data structures you work with can be manipulated in order to build the kind of results you want to generate.

Bob McBobson
  • 743
  • 1
  • 9
  • 29