1

I have a very large data set gotten from twitter. I am trying to figure out how to do the equivalent of python filtering like the below in numpy. The environment is the python interpreter

>>tweets = [['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], 
         ['is nice man that buhari']]
>>>filter(lambda x: 'buhari' in x[0].lower(), tweets) 
[['buhari si good'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']]

I tried boolean indexing like the below, but the array turned up empty

>>>tweet_arr = np.array([['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']])
>>>flat_tweets = tweet_arr[:, 0]
>>>flat_tweets
array(['buhari si good', 'atiku is great', 'buhari nfd sdfa atiku',
   'is nice man that buhari'], dtype='|S23')
>>>flat_tweets['buhari' in flat_tweets]
array([], shape=(0, 4), dtype='|S23')

I would like to know how to filter strings in a numpy array, the way I was easily able to filter even numbers here

>>> arr = np.arange(15).reshape((15,1))
>>>arr
array([[ 0],
   [ 1],
   [ 2],
   [ 3],
   [ 4],
   [ 5],
   [ 6],
   [ 7],
   [ 8],
   [ 9],
   [10],
   [11],
   [12],
   [13],
   [14]])
>>>arr[:][arr % 2 == 0]
array([ 0,  2,  4,  6,  8, 10, 12, 14])

Thanks

  • The exact same approach as in your first solution for lists works for NumPy arrays; is that not enough? – fuglede Jul 31 '18 at 17:42
  • @fuglede I am looking for the fastest approch in numpy, since I am working with a large database of tweets that are being streamed/updated daily. Would you say my first solution is the fastest approach? – Adeyinka Adegbenro Jul 31 '18 at 17:46
  • I wouldn't be surprised if it were actually faster than what I suggest below at least, but you can time it on your data set to see what works (using e.g. [IPython's %timeit magic](https://stackoverflow.com/questions/29280470/what-is-timeit-in-python#29280612)). – fuglede Jul 31 '18 at 17:52
  • 1
    `np.char` has functions that apply string methods to elements of arrays, but they aren't much faster than list comprehensions. For the most part string dtype arrays don't provide faster processing than lists. – hpaulj Jul 31 '18 at 18:02
  • @Adeyinka if you are working with strings, it is probably faster to just use vanilla Python, so the list-comprehension approach would be ideal. `numpy` does not optimize for strings, it targets fast **num**erical operations – juanpa.arrivillaga Jul 31 '18 at 18:03

1 Answers1

4

If you want to stick to a solution based entirely on NumPy, you could do

from numpy.core.defchararray import find, lower
tweet_arr[find(lower(tweet_arr), 'buhari') != -1]

You mention in a comment that what you're looking for here is performance, so it should be noted that this appears to be a good deal slower than the solution you came up with yourself:

In [33]: large_arr = np.repeat(tweet_arr, 10000)

In [36]: %timeit large_arr[find(lower(large_arr), 'buhari') != -1]
54.6 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [43]: %timeit list(filter(lambda x: 'buhari' in x.lower(), large_arr))
21.2 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact, an ordinary list comprehension beats both approaches:

In [44]: %timeit [x for x in large_arr if 'buhari' in x.lower()]
18.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fuglede
  • 17,388
  • 2
  • 54
  • 99