5

Hi I am working with python 3 and I've been facing this issue for a while now and I can't seem to figure this out.

I have 2 numpy arrays containing strings

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

If you notice, the array_one is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram for the sentence alice in a wonder land.

I purposefully have taken wonderland as two words wonder and land.

Now I have another numpy array that contains some locations and names.

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

Now what I want to do is get all the elements in the array_one that exist in array_two.

If I take out an intersection using np.intersect1d of the two arrays I don't get any matches since wonderland is two separate words in array_one while in array_two it's a single word.

Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3

array_one would at max have 60-100 items while array_two would at max have roughly 1 million items but an average of 250,000 - 500,000 items.


Edit

I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space from both arrays and then using the resultant boolean array ([True, False, True]) to `filter on the origional array. Below is the code:

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

But I am not sure this is the way to go considering the number of elements in array_two

iam.Carrot
  • 4,976
  • 2
  • 24
  • 71
  • can you try to use the levenshtein distance?? https://en.wikipedia.org/wiki/Levenshtein_distance – Espoir Murhabazi Feb 21 '18 at 18:49
  • @EspoirMurhabazi I thought of `levenshtein distance` and `Cosine string matching` but the problem is how do I implement them without using two `for` loops, that is the first issue and second, I need something that handles white space since a levenshtein distance of 1 would match `block A` and `block B` as the same while the `cosine` would match them at `0.90`. – iam.Carrot Feb 21 '18 at 18:53
  • Maybe you can use locality-sensitive hashing as discussed in [this SO question](https://stackoverflow.com/questions/25114338/approximate-string-matching-using-lsh) – ktdrv Feb 21 '18 at 19:33

2 Answers2

3

Minhashing could definitely be used here. Here's the very general idea behind minhashing: for each object in a list, hash the object multiple times, and update an object that keeps track of the hashes computed for each list member. Then examine the set of resulting hashes, and for each, find all objects for which that hash was computed (we just stored this data). The objects for which the same hash was computed will be very similar if the hashing function is chosen carefully.

For a more detailed explanation of minhashing, see Chapter 3 of Mining Massive Datasets.

Here's a sample Python 3 implementation using your data and datasketch (pip install datasketch), which computes the hashes:

import numpy as np
from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def build_minhash(s):
  '''Given a string `s` build and return a minhash for that string'''
  new_minhash = MinHash(num_perm=256)
  # hash each 3-character gram in `s`
  for chargram in ngrams(s, 3):
    new_minhash.update(''.join(chargram).encode('utf8'))
  return new_minhash

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# create a structure that lets us query for similar minhashes
lsh = MinHashLSH(threshold=0.3, num_perm=256)

# loop over the index and value of each member in array two
for idx, i in enumerate(array_two):
  # add the minhash to the lsh index
  lsh.insert(idx, build_minhash(i))

# find the items in array_one with 1+ matches in arr_two
for i in array_one:
  result = lsh.query(build_minhash(i))
  if result:
    matches = ', '.join([array_two[j] for j in result])
    print(' *', i, '--', matches)

Results (array_one member on left, array_two match on right):

 * wonder -- wonderland
 * a wonder -- wonderland
 * wonder land -- wonderland
 * a wonder land -- wonderland
 * in a wonder land -- wonderland
 * alice in a wonder land -- wonderland

The easiest way to tune for precision/recall here is by changing the threshold argument to MinHashLSH. You can also try modifying the hashing technique itself. Here I used 3-character hashes when building the minhash for each ngram, a technique Yale's Digital Humanities lab found tremendously powerful at capturing text similarity: https://github.com/YaleDHLab/intertext

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • I am having some trouble with this code (yes with the problem statement, this seems more viable solution). The issue is, array_one can change based on input but the array_two remains constant for all cases. I am having trouble creating that combined `numpy.array` since I was thinking I'll process array_two into a lsh object and store as a pickle – iam.Carrot Dec 19 '18 at 17:33
  • @iam.Carrot I just updated the above so `array_two` is static and `array_one` can be dynamic. When `array_one` updates, just build the minhash for the new element(s), then query the LSH index, which now only contains `array_two` elements. Does that make sense? – duhaime Dec 19 '18 at 17:48
  • ahh! I almost had the same code (tiny bit of difference). I'll quickly test it out. – iam.Carrot Dec 19 '18 at 17:53
  • it works just fine. Thank you for that snippet, I had put `array_one` instead of the `array_two` by mistake in `join([array_two[j] for j in result])`. I have two final questions, 1. how much can the MinHash hold? can it hold 1 million or even 10 million per say? and 2. If I have to apply Cosine or Levenshtien or even JaroWinkler, does LSH have a provision for em? or do I post process it? – iam.Carrot Dec 19 '18 at 18:19
  • @iamCarrot the max size of the LSH index is really a function of your server's RAM. You can also build an LSH index on disk (though you'll need to learn how the LSH index works--its pretty straightforward--and implement on your own) which can scale to the size of your disk. As for distance measures the minhash can't give you those; you'll need to run those measures on the results given by the LSH index. This is exactly what I've done in LSH-related work (I use difflib.SequenceMatcher and an LSH based on disk). Good luck! – duhaime Dec 19 '18 at 18:29
  • 1
    perfect. Thank you so much for taking the time to help me out. I really appreciate it. – iam.Carrot Dec 19 '18 at 18:31
  • This is a great answer! Trying to figure out if there's a way to grab the *index* that the value was found at, in addition to the found value? Any thoughts? – brandonscript Sep 16 '22 at 14:30
2

Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.

A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.

In your case we can add each member of array_two to the bloom filter, then query each member of array_one to see whether it's in the bloom filter. Using pip install bloom-filter:

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

Result: ['wonder land']

Depending on your requirements, this could be a very efficient (and highly-scalable) solution.

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • you've set a `max_elements=10000` is there a significance of it? Can I set it to `1 million`? – iam.Carrot Feb 25 '18 at 15:34
  • Yes, you should set the `max_elements` param to the number of elements you intend to pass in (you may need decent hardware to do it all in main memory). The bloom filter is simpler but much less intelligent than LSH -- looking at your data will help you decide what's best... – duhaime Feb 25 '18 at 16:16
  • Great. I have 1 last query, how does the `BloomFilter` handle duplicates? I am sorry if my comments are a little naïve since I just got introduced to these ways and why doesn't it match anything if the string length is less for example it doesn't match `a c c road` to `acc road` and when I put in `acc road` it still doesn't match and I run it a few times, it starts to match all the mentioned cases. is there something I am missing? – iam.Carrot Feb 25 '18 at 16:55
  • @iam.Carrot check out that visual demo I linked above. Your `acc road` example works as expected on my machine (ie match is found). Duplicate entries should have no effect on a bloom filter. However, a bloom filter won't handle substring matching -- that's what more abstract methods like LSH are for. Bloom filters are fast if you only need to find raw set intersections, but if you need something fuzzier LSH techniques are more flexible. I hope that helps! – duhaime Feb 25 '18 at 17:30
  • no yeah it does match, but I had to run the code a few times for it to match. I thought I am missing something. Maybe something trivial. Never the less, you've been of great help. Thanks much. – iam.Carrot Feb 25 '18 at 17:43