1

I need to do joins and anti-joins on the attribute 'sid' on large lists of dictionaries of the form:

systolic_sex = [
        {'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6}, 
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17}, 
        {'attribute': u'bp', 'value_d': 121.0, 'value_s': u'121', 'sid': 18}, 
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 27}, 
        {'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},      
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 18},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 27},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ]



sex = [
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},      
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},   
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ]

where the join would produce matches by 'sid' between the two sets of the form

join = [
        {'attribute': u'bp', 'value_d': 133.0, 'value_s': u'133', 'sid': 6}, 
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 6},
        {'attribute': u'bp', 'value_d': 127.0, 'value_s': u'127', 'sid': 17}, 
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 17},
        {'attribute': u'bp', 'value_d': 120.0, 'value_s': u'120', 'sid': 42},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 42}
    ] 

and the anti-join would give everything in the set systolic_sex NOT in sex (as matched on the attribute 'sid'):

anti_join = [
        {'attribute': u'bp', 'value_d': 121.0, 'value_s': u'121', 'sid': 18}, 
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 18},
        {'attribute': u'SEX', 'value_d': 0.0, 'value_s': u'M', 'sid': 27}
    ]

To do this, I used the isin function of Pandas as follows for the join:

right = pd.DataFrame(sex)
left = pd.DataFrame(systolic_sex)

# get matches
matches_right = right[right.sid.isin(left.sid)]
matches_left = left[left.sid.isin(right.sid)]

# combine matched sets into single set
frames = [matches_right, matches_left]

# convert back to list of dictionaries
test = pd.concat(frames).drop_duplicates().to_dict(orient='records') 

and for the anti-join:

right = pd.DataFrame(sex)
left = pd.DataFrame(systolic_sex)

# find complement where items in left not in right
frames = left[~left.sid.isin(right.sid)]

# convert back to list of dictionaries
test = frames.to_dict(orient='records')

This work fine and dandy, but my record sets are huge, and this takes a while to chug through.

I was digging around for ways of optimizing this, and tried using cython and numba, but, when the execution times using numba, cython, and the code without either were roughly the same, I did some digging and found the isnum uses cython anyway, so that was a dead end.

I played around with merge and join from the Pandas library, but the output is not in a format I wanted in terms of column names and thus working with this would be way too futzy.

My next attempt would be to try Numpy indexing, along the lines of this thread: efficiently-select-rows-that-match-one-of-several-values-in-pandas-dataframe, but I am not exactly sure how to implement my particular use case (any pointers would be most welcome).

Community
  • 1
  • 1
horcle_buzz
  • 2,101
  • 3
  • 30
  • 59
  • If you want you can turn the pandas dataframe into a set. I'm not sure how that would affect speed. a = set(df.itertuples()); b = set(df.itertuples()) then do a b-a and a-b. – Back2Basics Jan 31 '16 at 06:45
  • Thanks, but that's not quite it. As a simple example, for `a = [{'sid':1,'x':1},{'sid':1,'x':2},{'sid':2,'x':2}]` and `b = [{'sid':1,'x':3},{'sid':1,'x':1},{'sid':2,'x':4},{'sid':3,'x':1}]` I get the entire sets when I do the set difference you provided, since ALL attributes do not match. I actually need, `set(a) - set(b)` to give me only records where the attribute `sid` in set(b) is NOT in set(a), or in this example, `[{'sid':3,'x':1}]`, since `'sid':3` not found in `set(a)`. I don't care about the value of `x` when doing the match, I ONLY care about `sid`. – horcle_buzz Feb 01 '16 at 02:15
  • @horcle_buzz according to your previous comments and the suggestion from @Back2Basics, you just have to make a set on the `sid` value. `r_sid = set(right.sid); l_sid = set(left.sid)` then use it like you do : `left_match = left[left.sid.isin(r_sid)]` ? I'm not sure if there will be a speed improvement.. ? In the same way that what you are doing you can also use `sid` as the index `left.set_index('sid', inplace=True)` an do something like `left.ix[left.index.isin(r_sid)]` (there could be some speed-up using the `ix()` method of the DataFrame). – mgc Feb 01 '16 at 21:36
  • Thanks for the response. I did figure out how to use Numpy for pre-masking as per the link above. It really made no difference in processing times. I will post my benchmark test data later after I solve my more immediate issue. In a nutshell, both methods did joins between data frames with up to 6000 rows in a mere few milliseconds. I'll try using the `ix()` method later too and let you know the results along with the benchmark results, but I think I am pretty happy with how I was doing this in the first place (now that I know why my code is running so slow, that is). – horcle_buzz Feb 01 '16 at 23:44

0 Answers0