Is there a fast (vectorized) way to count a list of string contained within another list of strings?

Question

I want to count the number of two character vowel permutations that are contained within a list of 5-letter words. Vowel permutations are like 'aa','ae','ai,...,'ui','uo','uu'.

I successfully get this done using apply() but it is slow. I want to see if there is a fast vectorized way to get this done. I can't think of one.

Here's what I did:

import pandas as pd 
import itertools

vowels = list('aeiou')
vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]  

def wide_contains(x):
    return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm) 

dfwd['word'].apply(wide_contains).sum()

aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
ii     0
io     3
iu     0
oa     2
oe     2
oi     3
oo    11
ou     7
ua     2
ue     9
ui     2
uo     0
uu     0

The above is the expected output using the following data

word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']

dfwd = pd.DataFrame(word_lst, columns=['word'])

[This answer](https://stackoverflow.com/a/52674448/6627366) might help — Lfppfs, Apr 12 '22 at 20:43
Use python for the computation ... Pandas does not have any vectorized string operations, as it uses python string methods under the hood — sammywemmy, Apr 12 '22 at 21:31

score 3 · Accepted Answer · edited Apr 13 '22 at 01:26

Well, if not using Pandas to do this computation at all is alright, it looks like plain old Counter() is 220x faster on my machine for the data given.

from collections import Counter
import timeit


def timetest(func, name=None):
    name = name or getattr(func, "__name__", None)
    iters, time = timeit.Timer(func).autorange()
    iters_per_sec = iters / time
    print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")


def counter():
    ctr = Counter()
    for word in dfwd['word']:
        for perm in vowel_perm:
            if perm in word:
                ctr[perm] += 1
    return ctr


timetest(original)
timetest(counter)
print(counter())

outputs

name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})

Thanks. Great answer. Couldn't find `from itertools import Counter` but `from collections import Counter` did work. — jch, Apr 12 '22 at 23:15
It is kind of refreshing to see a good-old nested loop come in as the fastest solution (by my testing on my machine). — jch, Apr 13 '22 at 20:22

score 3 · Answer 2 · answered Apr 12 '22 at 20:55

How about some dictionary comprehension? It should be faster than using apply

{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

{'aa': 1,
 'ae': 2,
 'ai': 12,
 'ao': 2,
 'au': 8,
 'ea': 15,
 'ee': 15,
 'ei': 1,
 'eo': 5,
 'eu': 2,
 'ia': 7,
 'ie': 10,
 'ii': 0,
 'io': 3,
 'iu': 0,
 'oa': 2,
 'oe': 2,
 'oi': 3,
 'oo': 11,
 'ou': 7,
 'ua': 2,
 'ue': 9,
 'ui': 2,
 'uo': 0,
 'uu': 0}

score 1 · Answer 3 · answered Apr 12 '22 at 20:56

Another option is to simply iterate over the the vowel pairs and count the number of occurrences in word_lst of each pair. Note that for the current task, you don't need to create an explicit list: vowel_perm either, simply iterate over the map object:

out = pd.Series({pair: sum(True for w in word_lst if pair in w) 
                 for pair in map(''.join, itertools.product(vowels,vowels))})

On my machine, a benchmark says:

>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This seems to be about as half as fast as my Counter-based answer for the given dataset (on my machine). — AKX, Apr 13 '22 at 05:05

Code Different · Answer 4 · 2022-04-13T01:15:31.383

You can use numpy.char.find to search an array of string for a substring:

from itertools import product

word_lst = np.array([
    'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
    'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
    'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
    'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
    'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
    'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
    'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
    'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
], dtype="U")

dfwd = pd.Series({
    perm: (np.char.find(word_lst, perm) != -1).sum()
    for perm in ["".join(p) for p in product(list("aoeui"), repeat=2)]
})

@AKX thanks for the reminder. My bad for not reading through the question! — Code Different, Apr 12 '22 at 21:06
Thanks. Great answer. FYI ... `itertools.permutations` doesn't produce repeated elements like `'aa','oo','ee'` — jch, Apr 12 '22 at 23:39
@jch yes you are right. I should’ve taken a break from coding today. Not a productive day for me. — Code Different, Apr 13 '22 at 00:00

score 1 · Answer 5 · answered Apr 13 '22 at 00:12

1

Here is another way:

df['word'].str.extractall('([aeiou]{2})').groupby([0]).size()

Output:

0
aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
io     3
oa     2
oe     2
oi     3
oo    11
ou     6
ua     2
ue     9
ui     2

answered Apr 13 '22 at 00:12

rhug123

7,893
1
9
24

1

Wow. Great idea - and fast. Another variation on that: `df['word'].str.extractall('([aeiou]{2})').value_counts()` – jch Apr 13 '22 at 03:13

Is there a fast (vectorized) way to count a list of string contained within another list of strings?

5 Answers5