2

I am trying to identify whether two values held in different numpy orderdict objects are the same.

Both dictionaries were created by using the fetchallnumpy() option in turbodbc and consist of two keys. First key is an id field the second key is a string value of variable length. I want to see whether the string value in the fist set of dictionary items, is present in the second set of dictionary items.

It's probably worth noting that both dictionary objects are holding approximately 60 million values under each key.

I've tried several things so far:-

  1. np.isin(dict1[str_col],dict2[str_col])

    As a function but this was extremely slow, presumably because the string values are stored as dtype object.

  2. I've tried converting both ditctionary objects to numpy arrays with an explicit string type as np.asarray(dict1[str_col], dtype='S500') and then tried to use the isin and in1d functions. At which point the system runs out of RAM. Have swapped out 'S500' to dtype=np.string_ but still get a MemoryError. (ar=np.concatenate((ar1,ar2))) whilst performing the isin function.

  3. I also tried a for loop.

    [r in dict2[str_col] for r in dict1[str_col]]

    Again this was extremely slow.

My aim is to have a relatively quick way of testing the two string columns without running out of memory.

Additional Bits In the long run I'll be running more than one check as I'm trying to identify > new values and values that have changed.

Dictionary A = Current Data ['ID': [int,int,int]] Dictionary B = Historic Data ['record':[str,str,str]]

So the bits I'm interested in are :-

  • A != B (current record is different to historic record)
  • A not present in B (New record added to the database)
  • B not present in A (Records need to be redacted)

The last two elements the quickest way I've found so far has been to pass the id columns to a function that contains the np.isin(arr1,arr2). Takes on average 15 seconds to compare the data.

  • 2
    Please provide a [mcve]. What, **exactly** is the structure of the dict in question? – juanpa.arrivillaga Apr 24 '19 at 15:24
  • 1
    I'm also a bit confused about your question. You write that the "second key is a string of variable length", so if `dict1[str_col]` and `dict2[str_col]` both were strings, you would simply test for string equality: `str1 == str2`. However, you're approaches make sense only if you're dealing with nested objects (that is if `dict1[str_col]` is another dictionary or a list, or some sequence other than a string). In that case, you may want to have a look at this [post](https://stackoverflow.com/questions/7828867). – normanius Apr 24 '19 at 15:37
  • 1
    `numpy OrderedDict` doesn't mean much to me. 1) looks just like you are comparing the values from two dictionaries, where the values are some sort of numpy array. But if so, what shape and dtype? – hpaulj Apr 24 '19 at 16:16
  • So the fetchallnumpy() function on turboodbc creates a dictionary object from the query sent to the database structure is as follows:- – James Lockyer Apr 24 '19 at 17:02
  • The structure is as follows:- [Key]:[list of values]. Number of values is related to the number of rows in the database, and the key is essential the column heading from the database/query. In my query the ID column is dtype int64 and the text column is returned with a dtype of object. The np.isin() works very well on integer data but is cumbersome when the dtype is object. – James Lockyer Apr 24 '19 at 17:14
  • 1
    The information you provided is not consistent yet. First, I think you should reformulate the question to "what is the fastest way to compare two string lists". In that case you still need to specify more details: Does the order play a role? Are the lists sorted, or appear the entries in a fixed order? The organization of the data has a big impact on the best choice of algorithm. – normanius Apr 24 '19 at 19:20
  • Just don't use a `numpy.ndarray` to check these sorts of things. Use a `set`, if you are working with `str` objects. – juanpa.arrivillaga Apr 24 '19 at 19:22
  • Another comment: In your sample code, you check only if the entries `dict1[str_col]` appear in `dict2[str_col]`, but you don't test the converse. In other words, the question does not match with what you actually tried. With your two approaches (`np.isin`, list comprehension), you test only if one list is a subset of the other, but you don't check for equality. You maybe want to fix this incoherence to get a useful answer. – normanius Apr 24 '19 at 19:28
  • 1
    Firstly thanks for the help, and secondly apologies for my lack of coherence and general newby-ness! I'll try and add some more detail to the original post. – James Lockyer Apr 24 '19 at 20:13

2 Answers2

0

You can use np.searchsorted for faster searches:

ar1 = dict1[str_col]
ar2 = dict2[str_col]
sorter = np.argsort(ar2)
idx = np.searchsorted(ar2, ar1, sorter=sorter)
if idx.max() >= len(ar2):
    return False
return np.all(ar1 == ar2[sorter[idx]])
jdehesa
  • 58,456
  • 7
  • 77
  • 121
0

Still not entirely clear what you try to achieve (see my comments). But here's my short.

Pandas may offer a more efficient alternative to compare string lists. Haven't tested this myself for large chunks of data.

Try the following:

import pandas as pd
s1 = pd.Series(dict1[str_col])
s2 = pd.Series(dict2[str_col])
print(s1.isin(s2).all())

Or if you anyways need to iterate over all columns, you could convert the complete dictionaries into a dataframe:

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
for col in df1:
    print(df1[col].isin(df2[col]).all())

If you want to test for equality of the complete DataFrame, you could use pandas' assert_frame_equal. For example:

 pd.util.testing.assert_frame_equal(df1, df2)     
 # ...or if the ordering is not the same.
 pd.util.testing.assert_frame_equal(df1, df2, check_like=True)

Apparently, there is the possibility to dump turbodbc data into a pandas object directly (to_pandas()). See here: turbodbc documentation, advanced usage

normanius
  • 8,629
  • 7
  • 53
  • 83
  • I shifted to "pure" numpy as I was unable to hold the dataframes in RAM. Will try the series option as I imagine that has a smaller memory footprint. The post you linked to earlier was great and has highlighted some other options. – James Lockyer Apr 24 '19 at 20:42
  • Went with dataframes then used a left merge and filtered the results down. Seems to take 3 minutes to run on my system – James Lockyer Apr 30 '19 at 12:49