intersection of two lists of dictionaries by key/value pair

Question

I have two lists of dictionaries in the format:

systolic_sex = [
        {'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6}, 
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17}, 
        {'attribute': u'bp', 'value_d': 121.0, 'value_s': u'121', 'sid': 18}, 
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 27}, 
        {'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},      
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 18},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 27},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ]



sex = [
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},      
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ]

I want to match these lists by the value of the key 'sid,' so that if the same value of 'sid' is in both, I have a match, otherwise, I do not. If I have a match, I then append the matching dictionaries by 'sid' from both sets to a new list accordingly like so

new_set = [
        {'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6}, 
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17}, 
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},
        {'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ]

I've tried various methods of intersecting these, including modifying answers from Match set of dictionaries, but I am looking to create a new list of dictionaries that have the matching sids, not replacing values between the two lists.

Can you show what the desired output should look like in this case? Sounds like you effectively want to join the two "tables" on the `sid` "column"? — Amit Kumar Gupta, Oct 25 '15 at 03:32
LOL the output was nothing like I imagined it for some reason. Deleted my answer. — woot, Oct 25 '15 at 03:46
It was a nice remapping of the data! Useful for future efforts. — horcle_buzz, Oct 25 '15 at 04:05
NB: The set sex could be of the form systolic_sex, that is, having more than one set of different values for the 'attribute' key across a single value of the 'sid' key. For example, I could have two sets, systolic_sex and diastolic_sex, and then do the match of all keys by 'sid'. Sorry if this was not clear in the original post. In any case, @dawg's solution worked for this, even though I did not make this requirement clear. — horcle_buzz, Oct 25 '15 at 16:12

chthonicdaemon · Accepted Answer · 2015-10-25T15:39:43.990

3

You may be interested in using pandas if you're dealing with data like this a lot. Your dictionaries are already in the form pandas likes, so you can do this:

import pandas

systolic_sex = pandas.DataFrame(systolic_sex)
sex = pandas.DataFrame(sex)

matches = systolic_sex[systolic_sex.sid.isin(sex.sid)]

If you want the data back in the same format as you supplied them, you can to

output = matches.to_dict(orient='records')

edited Oct 25 '15 at 15:39

answered Oct 25 '15 at 04:15

chthonicdaemon

19,180
2
52
66

I've actually used Pandas for manipulating data and would have had I not been able to use a python only approach. I actually like this solution much better than the one I accepted as an answer, since it is more parsimonious, minus the extra import. – horcle_buzz Oct 25 '15 at 15:07
1

Yeah, I find that Pandas is becoming quite standard wherever I go, so I don't think it's much overhead for the vastly better handling of data like these. – chthonicdaemon Oct 25 '15 at 15:12
For what I had to do with it, it go a bit complicated. I will give it another chance. – horcle_buzz Oct 25 '15 at 15:17
Last step to convert DataFrame back to list of dictionaries is `matches.T.to_dict().values()` ... nice (you may have convinced me, especially since I have an upcoming project with a LOT of similar hoops to jump through)! – horcle_buzz Oct 25 '15 at 15:28
1

Oh, and I must comment that you are the rare breed of coder/data person I've seen actually use the correct subject-verb agreement for data! (**data are/datum is**) – horcle_buzz Oct 25 '15 at 15:33
Thanks. I'm a lecturer and pedantry goes with the territory. Note that if your data started life in json files (which it appears like) you can read directly into a dataframe with `pandas.read_json`, and the final step to convert is nicer as `matches.to_dict(orient='records')` – chthonicdaemon Oct 25 '15 at 15:38
Pedantry is great! Probably confounds many people, especially if you correct them, like I always do! (^_^) Thanks for the tip. I will give it a try. – horcle_buzz Oct 25 '15 at 15:45

inspectorG4dget · Answer 2 · 2015-10-25T03:52:00.273

1

Going off the answer in the post you linked:

systolic_sex = dict((e['sid'], e) for e in systolic_sex)
sex = set(e['sid'] for e in sex)

matches = []
for sid,v in systolic_sex.items():
    if sid not in sex: continue
    matches.append(v)

edited Oct 25 '15 at 03:52

answered Oct 25 '15 at 03:33

inspectorG4dget

110,290
27
149
241

I tried this based previous to edits, and it worked for the stated problem. However, I realized my first set should have been combined as systolic_sex. I have edited the original post accordingly. – horcle_buzz Oct 25 '15 at 03:47
Got it, partially, but I see where to go with it. Thanks for getting me back on track, and for teaching some new python tricks! (^_^) – horcle_buzz Oct 25 '15 at 04:04

score 1 · Answer 3 · answered Oct 25 '15 at 04:04

1

>>> uniq=set(e['sid'] for e in sex) 
>>> filter(lambda d: d['sid'] in uniq, systolic_sex)
[{'attribute': u'bp', 'sid': 6L, 'value_s': u'133', 'value_d': 133.0},        
 {'attribute': u'bp', 'sid': 17L, 'value_s': u'127', 'value_d': 127.0},  
 {'attribute': u'bp', 'sid': 42L, 'value_s': u'120', 'value_d': 120.0}, 
 {'attribute': u'SEX', 'sid': 6L, 'value_s': u'M', 'value_d': 0.0}, 
 {'attribute': u'SEX', 'sid': 17L, 'value_s': u'M', 'value_d': 0.0}, 
 {'attribute': u'SEX', 'sid': 42L, 'value_s': u'M', 'value_d': 0.0}]

answered Oct 25 '15 at 04:04

dawg

98,345
23
131
206

1

This is one of those cases where a list comprehension is both shorter and prettier: `[d for d in systolic_sex if d['sid'] in uniq]` – chthonicdaemon Oct 25 '15 at 15:14
I just tested this on two more generalized data sets than I gave in my example (with arbitrary duplicates on the value of the key 'sid,' but with more than two non-duplicate values on the key 'attribute', and it worked as anticipated). The pandas solution would more than likely need a bit of tweaking, since it gave the same results @inspectorG4dget's solution gave, which required some tweaking for the more general case. – horcle_buzz Oct 25 '15 at 15:42
To get to the general case, pandas was he easiest of the suggestions to use. – horcle_buzz Oct 26 '15 at 14:41

horcle_buzz · Answer 4 · 2015-10-27T16:05:33.363

1

I ended up using the following (as per @chtohnicdaemon):

import pandas
#-----> code snipped here
#----->
# iterate over record sets returned by SQLAlchemy to populate list
    for result in query_right:
        data = {'sid': result.patient_sid,
                'value_s': result.string_value,
                'value_d': result.double_value,
                'attribute': result.attribute_value}

                result_right.append(data)

    for result in left_child:
        data = {'sid': result.patient_sid,
                'value_s': result.string_value,
                'value_d': result.double_value,
                'attribute': result.attribute_value}

                result_left.append(data)

# convert list of dictionaries to data frames
right = pandas.DataFrame(right_result)
left = pandas.DataFrame(left_result)

# get matches
matches_right  = right[right.sid.isin(left.sid)]
matches_left  = left[left.sid.isin(right.sid)]

# combine matched sets into single set
frames = [matches_right,matches_left]

# concatenate data, drop duplicates and convert back to a list of dictionaries
result = pd.concat(frames).drop_duplicates().to_dict(orient='records')

Worked like a charm!

edited Oct 27 '15 at 16:05

answered Oct 26 '15 at 14:40

horcle_buzz

2,101
3
30
59

1

`drop_duplicates` doesn't work in-place, so that line isn't actually doing anything here. You also don't need the extra call to `pandas.Dataframe` in the "get matches" area. You can just do `pandas.concat(frames).drop_duplicates().to_dict(orient='records')` for all the parts from "concatenate sets" on down. – chthonicdaemon Oct 26 '15 at 17:39
drop_duplicates worked from the python interpreter (I have yet to test it in my code). Other than that, I guess the extra data frames are indeed redundant, otherwise I like the single line. Mahalo! – horcle_buzz Oct 26 '15 at 18:27
1

You would have seen the result, but it is not stored. Few of the pandas functions work in place. – chthonicdaemon Oct 26 '15 at 18:36
1

I see you're reading from a db. You can do it directly with [pandas.read_sql](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) – chthonicdaemon Oct 28 '15 at 14:44
Nice! Looks like I can do what I need to along the lines of reading `pandas.read_sql(query_right.statement, query_right.session.bind)` into a data frame... this will shave off a few clock ticks from processing time, especially when the returned record set is large, since I won't have to iterate through it to build the list. – horcle_buzz Oct 29 '15 at 02:18

intersection of two lists of dictionaries by key/value pair

4 Answers4