2

I am trying to compare certain values from 2 different origin (hence the two dictionaries) with each other, to know which values actually belong together. To illustrate, a shorter version of both my dictionaries with dummy data (enters added for clarity)

dict_1 = 
{'ins1': {'Start': 100, 'End': 110, 'Size': 10}, 
'ins2': {'Start': 150, 'End': 250, 'Size': 100}, 
'del1': {'Start': 210, 'End': 220, 'Size': 10}, 
'del2': {'Start': 260, 'End': 360, 'Size': 100}, 
'dup1': {'Start': 340, 'End': 350, 'Size': 10, 'Duplications': 3}, 
'dup2': {'Start': 370, 'End': 470, 'Size': 100, 'Duplications': 3}}

dict_2 = 
{'0': {'Start': 100, 'Read': 28, 'Prec': 'PRECISE', 'Size': 10, 'End': 110}, 
'1': {'Start': 500, 'Read': 38, 'Prec': 'PRECISE', 'Size': 100, 'End': 600}, 
'2': {'Start': 210, 'Read': 27, 'Prec': 'PRECISE', 'Size': 10, 'End': 220}, 
'3': {'Start': 650, 'Read': 31, 'Prec': 'IMPRECISE', 'Size': 100, 'End': 750}, 
'4': {'Start': 370, 'Read': 31, 'Prec': 'PRECISE', 'Size': 100, 'End': 470}, 
'5': {'Start': 340, 'Read': 31, 'Prec': 'PRECISE', 'Size': 10, 'End': 350}, 
'6': {'Start': 810, 'Read': 36, 'Prec': 'PRECISE', 'Size': 10, 'End': 820}}

What I want to compare are the "Start" and "End" values (and others but not specified here). If they match, I want to make a new dict (dict_3) that looks similar to this:

dict_3 = 
{'ins1': {'Start_d1': 100, 'Start_d2': 100, 'dict_2_ID': '0', etc}
{'del1': {'Start_d1': 210, 'Start_d2': 210, 'dict_2_ID': '2', etc}}

p.s I need both Start_d1 and Start_d2, because they can differ slightly in number (+-5).

I tried several options already on stack overflow, like: Concatenating dictionaries with different keys into Pandas dataframe (which could work I think, but I was having so much trouble with the dataframe format) and: Comparing two dictionaries in Python (which only works if the dictionary does not have a top-layer key (like here ins1, ins2 etc.)

Could someone give me a beginning to work further with? I tried so many things already and the nested dictionary gives me trouble with all solutions that I could find.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Fini
  • 163
  • 10
  • you'll have to transform your dicts so they have a start,end key instead so lookup will be a breeze. – Jean-François Fabre Oct 03 '18 at 09:23
  • but then i lose information, the 'ins1', 'ins2' ect. are unique values, which i explicity need to couple the results found back to the original data (same with the dict_2_ID) – Fini Oct 03 '18 at 09:26

2 Answers2

1

You can do something like this perhaps:

dict_1 = {'ins1': {'Start': 100, 'End': 110, 'Size': 10},
'ins2': {'Start': 150, 'End': 250, 'Size': 100}, 
'del1': {'Start': 210, 'End': 220, 'Size': 10}, 
'del2': {'Start': 260, 'End': 360, 'Size': 100}, 
'dup1': {'Start': 340, 'End': 350, 'Size': 10, 'Duplications': 3}, 
'dup2': {'Start': 370, 'End': 470, 'Size': 100, 'Duplications': 3}}

dict_2 = {'0': {'Start': 100, 'Read': 28, 'Prec': 'PRECISE', 'Size': 10, 'End': 110},
'1': {'Start': 500, 'Read': 38, 'Prec': 'PRECISE', 'Size': 100, 'End': 600}, 
'2': {'Start': 210, 'Read': 27, 'Prec': 'PRECISE', 'Size': 10, 'End': 220}, 
'3': {'Start': 650, 'Read': 31, 'Prec': 'IMPRECISE', 'Size': 100, 'End': 750}, 
'4': {'Start': 370, 'Read': 31, 'Prec': 'PRECISE', 'Size': 100, 'End': 470}, 
'5': {'Start': 340, 'Read': 31, 'Prec': 'PRECISE', 'Size': 10, 'End': 350}, 
'6': {'Start': 810, 'Read': 36, 'Prec': 'PRECISE', 'Size': 10, 'End': 820}}

dict_3 = {}
for d1 in dict_1:
    for d2 in dict_2:
        if dict_1[d1]["Start"] == dict_2[d2]["Start"] and dict_1[d1]["End"] == dict_2[d2]["End"]:
            dict_3[d1] = {"Start_d1": dict_1[d1]["Start"], "Start_d2": dict_2[d2]["Start"], "dict_2_ID": d2}

print(dict_3)                        

The above mentioned solution is of order n^2 which is not very efficient.

However, to make it more efficient (order n) you'll need to transform dict_2 in such a way that it contains "Start" and "End" values as it's key (Eg: 'S100E110') then lookup will be of constant time (dictionary lookup) ref. Then, you'll be able to do something like:

if str("S"+dict_1[d1]["Start"]+"E"+dict_1[d1]["End"]) in dict_2:    
   # add to dict_3
BlackBeard
  • 10,246
  • 7
  • 52
  • 62
  • A simple to use answer indeed (I always think too complicated), but wouldn't this give trouble because dictionaries are in no particular order? So it compares only the first entry with the first entry? – Fini Oct 03 '18 at 09:35
  • Actually it works :), the two loops of course compare everything to everything. Now I just need to find a way to compare similar numbers, but I think I can work that out, thanks :) – Fini Oct 03 '18 at 09:39
  • One question though, is there possibly a more "fast" way? I think if I have 10000 entries in each dictionary, this could become really slow? Or will that happen much much later (like millions?) – Fini Oct 03 '18 at 09:42
  • Thanks for the quick reply, however as I mentioned in the comment from @Jean-François Fabre, I will lose much information with this. Additionally, "Start" and "End" are not unique values and thus even more information will be lost (you need unique keys or overwrite). I just have to accept the n^2 then. However, I am now struggling with the similar numbers part, because there is no such thing as ± in python. using ufloat gives a weird output, which is not compareable anymore. Do you possibly have a simple answer for that instead of hard coding every possible solution (+1, +2, +3...... etc.)? – Fini Oct 03 '18 at 10:11
1

You can use Pandas; here's a demo:

import pandas as pd

df1 = pd.DataFrame.from_dict(dict_1, orient='index')
df2 = pd.DataFrame.from_dict(dict_2, orient='index')

res = pd.merge(df1, df2, on=['Start', 'End', 'Size'])

print(res)

   Start  End  Size  Duplications  Read     Prec
0    210  220    10           NaN    27  PRECISE
1    340  350    10           3.0    31  PRECISE
2    370  470   100           3.0    31  PRECISE
3    100  110    10           NaN    28  PRECISE
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you! This is indeed a bit more elegant compared to double looping, and I already thought pandas could help. Is there a way to change the 0,1,2,3 to 'ins1', 'ins2' ect.? (see also comment @BlackBeard), I still have troubles with my second part of the question (the ± in python). Do you possibly have any solution to that? currently looking into ufloat, but that has a weird output. I could hardcode every possibility, but I rather not. – Fini Oct 03 '18 at 10:23
  • 1
    `Is there a way to change the 0,1,2,3 to 'ins1', 'ins2' ect.?`: That would be a [new question](https://stackoverflow.com/questions/ask) (although likely already answered elsewhere). I don't understand the second part of your question. – jpp Oct 03 '18 at 10:25
  • `p.s I need both Start_d1 and Start_d2, because they can differ slightly in number (+-5).`, so like `dict_1["Start"] == (dict_2["Start"] ± 5)` – Fini Oct 03 '18 at 10:54
  • 1
    @Fini, I see, that's non trivial, you might want to look up `merge_asof`. – jpp Oct 03 '18 at 10:55
  • Thanks, exactly what I needed, but didn't know existed (merge_asof). Used it for finding the nearest "Start" with the "direction='nearest', tolerance=100" to find the closest match, not exceeding 100. – Fini Oct 03 '18 at 13:45