3

I'm iterating through a list of words to find the most frequently used character between words (i.e. in list [hello, hank], 'h' counts as appearing twice while 'l' counts as appearing once.). A python list works fine, but I'm also looking into NumPy (dtype array?), and Pandas. It looks like Numpy may be the way to go, but are there other packages to consider? How else can I make this function faster?

Code in Question:

def mostCommon(guessed, li):
    count = Counter()
    for words in li:
          for letters in set(words):
              count[letters]+=1
    return count.most_common()[:10]

Thanks.

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
ZtoYi
  • 180
  • 1
  • 1
  • 11

6 Answers6

3

option 1

def pir1(li):
    sets = [set(s) for s in li]
    ul = np.array(list(set.union(*sets)))
    us = np.apply_along_axis(set, 1, ul[:, None])
    c = (sets >= us).sum(1)
    a = c.argsort()[:-11:-1]
    return ul[a]

option 2

def pir2(li):
    return Counter(chain.from_iterable([list(set(i)) for i in li])).most_common(10)

Assume a list of words li

import pandas as pd
import numpy as np
from string import ascii_lowercase

li = pd.DataFrame(
    np.random.choice(list(ascii_lowercase), (1000, 10))
).sum(1).tolist()

Including Divakar's and OP's functions

def tabulate_occurrences(a):
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

def topNchars(a, N = 10):
    s = np.core.defchararray.lower(a).view('uint8')
    unq, count = np.unique(s[s!=0], return_counts=1)
    sidx = count.argsort()[-N:][::-1]
    h = unq[sidx]
    return [str(chr(i)) for i in h]

def mostCommon(li):
    count = Counter()
    for words in li:
          for letters in set(words):
              count[letters]+=1
    return count.most_common()[:10]

testing

import pandas as pd
import numpy as np
from string import ascii_lowercase
from timeit import timeit

results = pd.DataFrame(
    index=pd.RangeIndex(5, 405, 5, name='No. Words'),
    columns=pd.Index('pir1 pir2 mostCommon topNchars'.split(), name='Method'),
)

np.random.seed([3,1415])
for i in results.index:    
    li = pd.DataFrame(
        np.random.choice(list(ascii_lowercase), (i, 10))
    ).sum(1).tolist()
    for j in results.columns:
        v = timeit(
            '{}(li)'.format(j),
            'from __main__ import {}, li'.format(j),
            number=100
        )
        results.set_value(i, j, v)

ax = results.plot(title='Time Testing')
ax.set_ylabel('Time of 100 iterations')

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • So, the final o/p would be that one max occuring char, right? – Divakar Jan 13 '17 at 19:22
  • @Divakar based on their code, they wanted top 10. Also comment on other answer "what if I wanted next most frequent?" – piRSquared Jan 13 '17 at 19:24
  • @piRSquared I'm trying to time these options as you spit them out! Thanks! Can you sort them in a way that I can get the top 10 chars (ties don't matter)? Thanks. – ZtoYi Jan 13 '17 at 19:24
  • @ZtoYi I've gotta run. Hopefully I was helpful. Good luck. – piRSquared Jan 13 '17 at 19:27
  • Good options but I don't think (as currently implemented) these fill the requirement of only counting each letter once per word – Chris_Rands Jan 13 '17 at 19:37
  • @piRSquared option1 does not take into account that letters that repeat in a word should not be counted twice. – ZtoYi Jan 13 '17 at 19:39
  • @Chris_Rands / ZtoYi yep, I see that. working on it – piRSquared Jan 13 '17 at 19:45
  • @piRSquared Option1 seems to be slower than the original solution I posted, why do you think it is a better implementation? As for options 2/3, I don't know much about pandas/numpy so I'm trying to figure out how to get just the list of chars at the moment, then I'll test it and let you know the speedups! – ZtoYi Jan 13 '17 at 20:03
  • @ZtoYi the old answer said "better". I intentionally left that off for now. I'm continuing to try to optimize these – piRSquared Jan 13 '17 at 20:07
  • @piRSquared Very nice graph, summarizes the results :) Thanks! – ZtoYi Jan 13 '17 at 20:54
3

Here's a NumPy approach using its views-concept -

def tabulate_occurrences(a):           # Case sensitive
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

def topNchars(a, N = 10):               # Case insensitive
    s = np.core.defchararray.lower(a).view('uint8')
    unq, count = np.unique(s[s!=0], return_counts=1)
    sidx = count.argsort()[-N:][::-1]
    h = unq[sidx]
    return [str(unichr(i)) for i in h]

Sample run -

In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']

In [323]: tabulate_occurrences(a) # Case sensitive
Out[323]: 
  char  count
0    I      3
1    S      2
2    e      2
3    i      1
4    o      1
5    r      2
6    s      2
7    u      1
8    y      1

In [533]: topNchars(a, 5)         # Case insensitive
Out[533]: ['s', 'i', 'r', 'e', 'y']

In [534]: topNchars(a, 10)        # Case insensitive
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

Assuming you only want the most frequent character, where each character only counts a once per word:

>>> from itertools import chain
>>> l = ['hello', 'hank']
>>> chars = list(chain.from_iterable([list(set(word)) for word in l]))
>>> max(chars, key=chars.count)
'h'

Using max with list.count can be a lot faster than using Counter due to the C level implementation.

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
  • what if I wanted the next most frequent letter? – ZtoYi Jan 13 '17 at 19:20
  • @ZtoYi Then you may not want to use this approach; although it's still possible, you could use `heapq` or sort the list of `chars`, but both will add quite a bit of overhead – Chris_Rands Jan 13 '17 at 19:25
0

This looks like it is very fast already, and runs in O(n). The only real improvement opportunity that I see would be to parallelize this process by splitting li into multiple parts.

A. Sokol
  • 336
  • 2
  • 13
  • It's an O(n) problem, but was wondering if using a different object might help the speed. (i.e. this post http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists) – ZtoYi Jan 13 '17 at 18:57
0

Here is a pure Python solution that uniqueifies each string, joins the sets, then counts the results (Using Divakar's example list)

>>> li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
>>> Counter(e for sl in map(list, map(set, li)) for e in sl)
Counter({'I': 3, 'e': 2, 's': 2, 'S': 2, 'r': 2, 'o': 1, 'i': 1, 'u': 1, 'y': 1})

If you want upper and lower case to be counted as the same letter:

>>> Counter(e for sl in map(list, map(set, [s.lower() for s in li])) for e in sl)
Counter({'i': 4, 's': 4, 'e': 2, 'r': 2, 'o': 1, 'u': 1, 'y': 1})

Now let's time that:

from __future__ import print_function
from collections import Counter
import numpy as np
import pandas as pd

def dawg(li):
    return Counter(e for sl in map(list, map(set, li)) for e in sl)

def nump(a):
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

if __name__=='__main__':
    import timeit  
    li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS'] 
    for f in (dawg, nump):
        print("   ",f.__name__, timeit.timeit("f(li)", setup="from __main__ import f, li", number=100) )  

Results:

dawg 0.00134205818176
nump 0.0347728729248

The Python solution is significantly faster in this case

dawg
  • 98,345
  • 23
  • 131
  • 206
-1

Just do

counter = Counter(''.join(li))
most_common = counter.most_common()

and you're done

blue_note
  • 27,712
  • 9
  • 72
  • 90