Hi I am working with python 3
and I've been facing this issue for a while now and I can't seem to figure this out.
I have 2 numpy arrays containing strings
array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
If you notice, the array_one
is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram
for the sentence alice in a wonder land
.
I purposefully have taken
wonderland
as two wordswonder
andland
.
Now I have another numpy array
that contains some locations and names.
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])
Now what I want to do is get all the elements in the array_one
that exist in array_two
.
If I take out an intersection using np.intersect1d
of the two arrays I don't get any matches since wonderland
is two separate words in array_one
while in array_two
it's a single word.
Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3
array_one
would at max have 60-100 items whilearray_two
would at max have roughly 1 million items but an average of 250,000 - 500,000 items.
Edit
I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space
from both arrays
and then using the resultant boolean
array ([True, False, True]) to `filter on the origional array. Below is the code:
import numpy.core.defchararray as np_f
import numpy as np
array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]
But I am not sure this is the way to go considering the number of elements in
array_two