0

I want to count the number of two character vowel permutations that are contained within a list of 5-letter words. Vowel permutations are like 'aa','ae','ai,...,'ui','uo','uu'.

I successfully get this done using apply() but it is slow. I want to see if there is a fast vectorized way to get this done. I can't think of one.

Here's what I did:

import pandas as pd 
import itertools

vowels = list('aeiou')
vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]  

def wide_contains(x):
    return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm) 

dfwd['word'].apply(wide_contains).sum()

aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
ii     0
io     3
iu     0
oa     2
oe     2
oi     3
oo    11
ou     7
ua     2
ue     9
ui     2
uo     0
uu     0

The above is the expected output using the following data

word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']

dfwd = pd.DataFrame(word_lst, columns=['word'])
jch
  • 3,600
  • 1
  • 15
  • 17

5 Answers5

3

Well, if not using Pandas to do this computation at all is alright, it looks like plain old Counter() is 220x faster on my machine for the data given.

from collections import Counter
import timeit


def timetest(func, name=None):
    name = name or getattr(func, "__name__", None)
    iters, time = timeit.Timer(func).autorange()
    iters_per_sec = iters / time
    print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")


def counter():
    ctr = Counter()
    for word in dfwd['word']:
        for perm in vowel_perm:
            if perm in word:
                ctr[perm] += 1
    return ctr


timetest(original)
timetest(counter)
print(counter())

outputs

name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks. Great answer. Couldn't find `from itertools import Counter` but `from collections import Counter` did work. – jch Apr 12 '22 at 23:15
  • Oop, yeah, typo :) – AKX Apr 13 '22 at 05:05
  • 1
    It is kind of refreshing to see a good-old nested loop come in as the fastest solution (by my testing on my machine). – jch Apr 13 '22 at 20:22
3

How about some dictionary comprehension? It should be faster than using apply

{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

{'aa': 1,
 'ae': 2,
 'ai': 12,
 'ao': 2,
 'au': 8,
 'ea': 15,
 'ee': 15,
 'ei': 1,
 'eo': 5,
 'eu': 2,
 'ia': 7,
 'ie': 10,
 'ii': 0,
 'io': 3,
 'iu': 0,
 'oa': 2,
 'oe': 2,
 'oi': 3,
 'oo': 11,
 'ou': 7,
 'ua': 2,
 'ue': 9,
 'ui': 2,
 'uo': 0,
 'uu': 0}
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
1

Another option is to simply iterate over the the vowel pairs and count the number of occurrences in word_lst of each pair. Note that for the current task, you don't need to create an explicit list: vowel_perm either, simply iterate over the map object:

out = pd.Series({pair: sum(True for w in word_lst if pair in w) 
                 for pair in map(''.join, itertools.product(vowels,vowels))})

On my machine, a benchmark says:

>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • This seems to be about as half as fast as my Counter-based answer for the given dataset (on my machine). – AKX Apr 13 '22 at 05:05
1

You can use numpy.char.find to search an array of string for a substring:

from itertools import product

word_lst = np.array([
    'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
    'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
    'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
    'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
    'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
    'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
    'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
    'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
], dtype="U")

dfwd = pd.Series({
    perm: (np.char.find(word_lst, perm) != -1).sum()
    for perm in ["".join(p) for p in product(list("aoeui"), repeat=2)]
})
Code Different
  • 90,614
  • 16
  • 144
  • 163
1

Here is another way:

df['word'].str.extractall('([aeiou]{2})').groupby([0]).size()

Output:

0
aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
io     3
oa     2
oe     2
oi     3
oo    11
ou     6
ua     2
ue     9
ui     2
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • 1
    Wow. Great idea - and fast. Another variation on that: `df['word'].str.extractall('([aeiou]{2})').value_counts()` – jch Apr 13 '22 at 03:13