Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.
I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:
TF-A TF-A TF-B TF-B
ids gene name score sum hits score sum hits ... gene description
id1 gene A 53.85 14 37.65 7 ... stuff
id2 gene B 97.55 11 63.94 8 ... stuff
id3 gene C 88.67 9 79.43 12 ... stuff
id4 gene D 69.35 12 13.03 13 ... stuff
... ...... ...... ... ... ... ..........
idx gene Z 49.32 8 84.03 10 ... stuff
The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.
So, before I build the dataframe, I make a list that expands for every TF given.
ColumnList = []
for tf_id in tf_id_list:
ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]
With this list, I concatenate it with the other column names, and then instantiate my dataframe.
df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description'])
As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?
I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):
Modifying a subset of rows in a pandas dataframe
df.ix[df.A==0, 'B'] = np.nan
If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:
for id in common_genes:
for tf_id in tf_id_list:
df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores
I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.