I need to do joins and anti-joins on the attribute 'sid' on large lists of dictionaries of the form:
systolic_sex = [
{'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6},
{'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17},
{'attribute': u'bp', 'value_d': 121.0, 'value_s': u'121', 'sid': 18},
{'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 27},
{'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 18},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 27},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
]
sex = [
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
]
where the join would produce matches by 'sid' between the two sets of the form
join = [
{'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},
{'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},
{'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
]
and the anti-join would give everything in the set systolic_sex NOT in sex (as matched on the attribute 'sid'):
anti_join = [
{'attribute': u'bp', 'value_d': 121.0, 'value_s': u'121', 'sid': 18},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 18},
{'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 27}
]
To do this, I used the isin
function of Pandas as follows for the join:
right = pd.DataFrame(sex)
left = pd.DataFrame(systolic_sex)
# get matches
matches_right = right[right.sid.isin(left.sid)]
matches_left = left[left.sid.isin(right.sid)]
# combine matched sets into single set
frames = [matches_right, matches_left]
# convert back to list of dictionaries
test = pd.concat(frames).drop_duplicates().to_dict(orient='records')
and for the anti-join:
right = pd.DataFrame(sex)
left = pd.DataFrame(systolic_sex)
# find complement where items in left not in right
frames = left[~left.sid.isin(right.sid)]
# convert back to list of dictionaries
test = frames.to_dict(orient='records')
This work fine and dandy, but my record sets are huge, and this takes a while to chug through.
I was digging around for ways of optimizing this, and tried using cython and numba, but, when the execution times using numba, cython, and the code without either were roughly the same, I did some digging and found the isnum
uses cython anyway, so that was a dead end.
I played around with merge
and join
from the Pandas library, but the output is not in a format I wanted in terms of column names and thus working with this would be way too futzy.
My next attempt would be to try Numpy indexing, along the lines of this thread: efficiently-select-rows-that-match-one-of-several-values-in-pandas-dataframe, but I am not exactly sure how to implement my particular use case (any pointers would be most welcome).