I am trying to identify whether two values held in different numpy orderdict objects are the same.
Both dictionaries were created by using the fetchallnumpy()
option in turbodbc
and consist of two keys. First key is an id field the second key is a string value of variable length. I want to see whether the string value in the fist set of dictionary items, is present in the second set of dictionary items.
It's probably worth noting that both dictionary objects are holding approximately 60 million values under each key.
I've tried several things so far:-
np.isin(dict1[str_col],dict2[str_col])
As a function but this was extremely slow, presumably because the string values are stored as
dtype
object.I've tried converting both ditctionary objects to
numpy
arrays with an explicit string type asnp.asarray(dict1[str_col], dtype='S500')
and then tried to use theisin
andin1d
functions. At which point the system runs out of RAM. Have swapped out 'S500' todtype=np.string_
but still get aMemoryError
.(ar=np.concatenate((ar1,ar2)))
whilst performing theisin
function.I also tried a for loop.
[r in dict2[str_col] for r in dict1[str_col]]
Again this was extremely slow.
My aim is to have a relatively quick way of testing the two string columns without running out of memory.
Additional Bits In the long run I'll be running more than one check as I'm trying to identify > new values and values that have changed.
Dictionary A = Current Data ['ID': [int,int,int]] Dictionary B = Historic Data ['record':[str,str,str]]
So the bits I'm interested in are :-
- A != B (current record is different to historic record)
- A not present in B (New record added to the database)
- B not present in A (Records need to be redacted)
The last two elements the quickest way I've found so far has been to pass the id columns to a function that contains the np.isin(arr1,arr2). Takes on average 15 seconds to compare the data.