Comparing key from first dictionary to values from second dictionary

Question

Please I need some help again.

I have a big data base file (let's call it db.csv) containing many informations.

Simplified database file to illustrate:

I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1

 0 ['VFG003386', 'VFG034084', 'VFG003381']  
 1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']  
 2 ['VFG018349', 'VFG018485', 'VFG043567']  
 ...  
 14471 ['VFG015743', 'VFG002143']

So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2

 VFG044259 IA027
 VFG044258 IA027
 VFG011941 CVF397
 VFG012016 CVF399
 ...

What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:

IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...

So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.

I don't know much about bio - can the same gene_id (VFG) show up in multiple clusters? — Shane Spoor, Jul 19 '17 at 09:35
Yes some of them are unfortunately. Maybe have something like that IA027 [0|12, 5, 6, 8] or IA027 [0(12), 5, 6, 8] — rookie max, Jul 19 '17 at 09:36

BioGeek · Accepted Answer · 2017-07-19T14:13:54.130

First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.

You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:

cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
                          1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
                          2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
                          5: ['VFG011941'],
                          7949: ['VFG003386'],                              
                          14471: ['VFG015743', 'VFG002143', 'VFG012016']}

And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:

gene_id_to_vf_accession = {'VFG044259': 'IA027',
                           'VFG044258': 'IA027',
                           'VFG011941': 'CVF397',
                           'VFG012016': 'CVF399',
                           'VFG000676': 'VF0142',
                           'VFG002231': 'VF0369',
                           'VFG003386': 'CVF051'}

And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.

We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.

So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value

from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
    vf_accession_to_gene_ids[vf_accession].append(gene_id)

For the sample data I posted above, vf_accession_to_gene_ids now looks like:

defaultdict(<class 'list'>, {'VF0142': ['VFG000676'], 
                             'CVF051': ['VFG003386'], 
                             'IA027':  ['VFG044258', 'VFG044259'],
                             'CVF399': ['VFG012016'], 
                             'CVF397': ['VFG011941'], 
                             'VF0369': ['VFG002231']})

Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:

vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
    gene_ids = vf_accession_to_gene_ids[vf_accession]
    cluster_group = []
    for gene_id in gene_ids:
        for cluster_nr in cluster_nr_to_gene_ids:
            if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
                cluster_group.append(cluster_nr)
    vf_accession_to_cluster_groups[vf_accession] = cluster_group

The end result for the above sample data now is:

{'VF0142': [], 
 'CVF051': [0, 7949], 
 'IA027':  [0], 
 'CVF399': [2, 14471], 
 'CVF397': [5], 
 'VF0369': []}

I'm truly truly truly thankful for your help, but few problems remain if u can please help me more: In cluster_nr_to_gene_ids a same gene_id can have many cluster number. Illustration: 0 ['VFG003386'] 7949 ['VFG003386'] so vf_accession should have in it those two cluster group. CVF051['0, 7949'] but it's only giving me one: CVF051[0] — rookie max, Jul 19 '17 at 13:07
@rookiemax, my code works when a gene ID is in multiple clusters, see my sample data that I have updated with the example you provided. See either you're doing something wrong or you need to provide me with a more complete data set to see where things go wrong. — BioGeek, Jul 19 '17 at 14:10
You are right I was doing something wrong, my bad. It worked perfectly after I removed a line of my code :D I am really greatful, thank you a bunch for your help :D really tank you :D — rookie max, Jul 19 '17 at 14:33

Shane Spoor · Answer 2 · 2017-07-20T01:39:10.410

Caveat: I don't do much Python development, so there's likely a better way to do this. You can first map your VFG... gene_ids to their cluster numbers, and then use that to process the second dictionary:

from collections import defaultdict
import sys
import ast

# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists
vfg_cluster_map = defaultdict(list)

# map all of the vfg... keys to their cluster numbers first
with open(sys.argv[1], 'r') as dict_1:
    for line in dict_1:
        # split the line at the first space to separate the cluster number and gene ID list
        # e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']",
        # cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']"
        cluster_group_num, vfg_list = line.strip().split(' ', 1)
        cluster_group_num = int(cluster_group_num)

        # convert "['VFG...', 'VFG...']" from a string to an actual list
        vfg_list = ast.literal_eval(vfg_list)
        for vfg in vfg_list:
            vfg_cluster_map[vfg].append(cluster_group_num)

# you now have a dictionary mapping gene IDs to the clusters they
# appear in, e.g 
# {'VFG003386': [0],
#  'VFG034084': [0],
#  ...}
# you can look in that dictionary to find the cluster numbers corresponding
# to your vfg... keys in dict_2 and add them to the list for that vf_accession
vf_accession_cluster_map = defaultdict(list)
with open(sys.argv[2], 'r') as dict_2:
    for line in dict_2:
        vfg, vf_accession = line.strip().split(' ')

        # add the list of cluster numbers corresponding to this vfg... to
        # the list of cluster numbers corresponding to this vf_accession 
        vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg])

for vf_accession, cluster_list in vf_accession_cluster_map.items():
    print vf_accession + ' ' + str(cluster_list)

Then save the above script and invoke it like python <script name> dict1_file dict2_file > output (or you could write the strings to a file instead of printing them and redirecting).

EDIT: After looking at @BioGeek's answer, I should note that it would make more sense to process this all in one shot than to create dict_1 and dict_2 files, read them in, parse the lines back into numbers and lists, etc. If you don't need to write the dictionaries to a file first, then you can just add your other code to the script and use the dictionaries directly.

I actually used some of your code today to resolve a problem of mine. Plus I learned new things for python coding so thanks again :D — rookie max, Jul 20 '17 at 08:00

Comparing key from first dictionary to values from second dictionary

2 Answers2