Fastest object to iterate a chars in a list of strings

Question

I'm iterating through a list of words to find the most frequently used character between words (i.e. in list [hello, hank], 'h' counts as appearing twice while 'l' counts as appearing once.). A python list works fine, but I'm also looking into NumPy (dtype array?), and Pandas. It looks like Numpy may be the way to go, but are there other packages to consider? How else can I make this function faster?

Code in Question:

def mostCommon(guessed, li):
    count = Counter()
    for words in li:
          for letters in set(words):
              count[letters]+=1
    return count.most_common()[:10]

Thanks.

Can you explain what you mean by "most frequent unique character"? And include some example input and output data — Chris_Rands, Jan 13 '17 at 18:49
So you only want the most frequent character or the frequency of all the characters? — Chris_Rands, Jan 13 '17 at 19:06

piRSquared · Answer 1 · 2017-01-13T20:55:13.617

3

option 1

def pir1(li):
    sets = [set(s) for s in li]
    ul = np.array(list(set.union(*sets)))
    us = np.apply_along_axis(set, 1, ul[:, None])
    c = (sets >= us).sum(1)
    a = c.argsort()[:-11:-1]
    return ul[a]

option 2

def pir2(li):
    return Counter(chain.from_iterable([list(set(i)) for i in li])).most_common(10)

Assume a list of words li

import pandas as pd
import numpy as np
from string import ascii_lowercase

li = pd.DataFrame(
    np.random.choice(list(ascii_lowercase), (1000, 10))
).sum(1).tolist()

Including Divakar's and OP's functions

def tabulate_occurrences(a):
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

def topNchars(a, N = 10):
    s = np.core.defchararray.lower(a).view('uint8')
    unq, count = np.unique(s[s!=0], return_counts=1)
    sidx = count.argsort()[-N:][::-1]
    h = unq[sidx]
    return [str(chr(i)) for i in h]

def mostCommon(li):
    count = Counter()
    for words in li:
          for letters in set(words):
              count[letters]+=1
    return count.most_common()[:10]

testing

import pandas as pd
import numpy as np
from string import ascii_lowercase
from timeit import timeit

results = pd.DataFrame(
    index=pd.RangeIndex(5, 405, 5, name='No. Words'),
    columns=pd.Index('pir1 pir2 mostCommon topNchars'.split(), name='Method'),
)

np.random.seed([3,1415])
for i in results.index:    
    li = pd.DataFrame(
        np.random.choice(list(ascii_lowercase), (i, 10))
    ).sum(1).tolist()
    for j in results.columns:
        v = timeit(
            '{}(li)'.format(j),
            'from __main__ import {}, li'.format(j),
            number=100
        )
        results.set_value(i, j, v)

ax = results.plot(title='Time Testing')
ax.set_ylabel('Time of 100 iterations')

edited Jan 13 '17 at 20:55

answered Jan 13 '17 at 18:57

piRSquared

285,575
57
475
624

So, the final o/p would be that one max occuring char, right? – Divakar Jan 13 '17 at 19:22
@Divakar based on their code, they wanted top 10. Also comment on other answer "what if I wanted next most frequent?" – piRSquared Jan 13 '17 at 19:24
@piRSquared I'm trying to time these options as you spit them out! Thanks! Can you sort them in a way that I can get the top 10 chars (ties don't matter)? Thanks. – ZtoYi Jan 13 '17 at 19:24
@ZtoYi I've gotta run. Hopefully I was helpful. Good luck. – piRSquared Jan 13 '17 at 19:27
Good options but I don't think (as currently implemented) these fill the requirement of only counting each letter once per word – Chris_Rands Jan 13 '17 at 19:37
@piRSquared option1 does not take into account that letters that repeat in a word should not be counted twice. – ZtoYi Jan 13 '17 at 19:39
@Chris_Rands / ZtoYi yep, I see that. working on it – piRSquared Jan 13 '17 at 19:45
@piRSquared Option1 seems to be slower than the original solution I posted, why do you think it is a better implementation? As for options 2/3, I don't know much about pandas/numpy so I'm trying to figure out how to get just the list of chars at the moment, then I'll test it and let you know the speedups! – ZtoYi Jan 13 '17 at 20:03
@ZtoYi the old answer said "better". I intentionally left that off for now. I'm continuing to try to optimize these – piRSquared Jan 13 '17 at 20:07
@piRSquared Very nice graph, summarizes the results :) Thanks! – ZtoYi Jan 13 '17 at 20:54

Divakar · Accepted Answer · 2017-01-13T20:32:37.610

3

Here's a NumPy approach using its views-concept -

def tabulate_occurrences(a):           # Case sensitive
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

def topNchars(a, N = 10):               # Case insensitive
    s = np.core.defchararray.lower(a).view('uint8')
    unq, count = np.unique(s[s!=0], return_counts=1)
    sidx = count.argsort()[-N:][::-1]
    h = unq[sidx]
    return [str(unichr(i)) for i in h]

Sample run -

In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']

In [323]: tabulate_occurrences(a) # Case sensitive
Out[323]: 
  char  count
0    I      3
1    S      2
2    e      2
3    i      1
4    o      1
5    r      2
6    s      2
7    u      1
8    y      1

In [533]: topNchars(a, 5)         # Case insensitive
Out[533]: ['s', 'i', 'r', 'e', 'y']

In [534]: topNchars(a, 10)        # Case insensitive
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']

edited Jan 13 '17 at 20:32

answered Jan 13 '17 at 19:21

Divakar

218,885
19
262
358

Hm. I just timed this, looks slower than a regular list solution. – ZtoYi Jan 13 '17 at 19:37
@ZtoYi Do you need the actual count of occurrences of top 10 characters too or just want to find those characters only? – Divakar Jan 13 '17 at 19:45
Just the characters will be fine. – ZtoYi Jan 13 '17 at 19:48
@ZtoYi Also, are you only dealing with lower case letters? – Divakar Jan 13 '17 at 19:50
lower case and upper case should count as the same letter. – ZtoYi Jan 13 '17 at 20:10
@ZtoYi Try the latest one named : `topNchars`? – Divakar Jan 13 '17 at 20:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133114/discussion-between-ztoyi-and-divakar). – ZtoYi Jan 13 '17 at 20:56

Chris_Rands · Answer 3 · 2017-01-13T19:32:59.237

2

Assuming you only want the most frequent character, where each character only counts a once per word:

>>> from itertools import chain
>>> l = ['hello', 'hank']
>>> chars = list(chain.from_iterable([list(set(word)) for word in l]))
>>> max(chars, key=chars.count)
'h'

Using max with list.count can be a lot faster than using Counter due to the C level implementation.

edited Jan 13 '17 at 19:32

answered Jan 13 '17 at 19:09

Chris_Rands

38,994
14
83
119

what if I wanted the next most frequent letter? – ZtoYi Jan 13 '17 at 19:20
@ZtoYi Then you may not want to use this approach; although it's still possible, you could use `heapq` or sort the list of `chars`, but both will add quite a bit of overhead – Chris_Rands Jan 13 '17 at 19:25

score 0 · Answer 4 · answered Jan 13 '17 at 18:54

0

This looks like it is very fast already, and runs in O(n). The only real improvement opportunity that I see would be to parallelize this process by splitting li into multiple parts.

answered Jan 13 '17 at 18:54

A. Sokol

336
2
13

It's an O(n) problem, but was wondering if using a different object might help the speed. (i.e. this post http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists) – ZtoYi Jan 13 '17 at 18:57

dawg · Answer 5 · 2017-01-13T20:19:30.960

Here is a pure Python solution that uniqueifies each string, joins the sets, then counts the results (Using Divakar's example list)

>>> li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
>>> Counter(e for sl in map(list, map(set, li)) for e in sl)
Counter({'I': 3, 'e': 2, 's': 2, 'S': 2, 'r': 2, 'o': 1, 'i': 1, 'u': 1, 'y': 1})

If you want upper and lower case to be counted as the same letter:

>>> Counter(e for sl in map(list, map(set, [s.lower() for s in li])) for e in sl)
Counter({'i': 4, 's': 4, 'e': 2, 'r': 2, 'o': 1, 'u': 1, 'y': 1})

Now let's time that:

from __future__ import print_function
from collections import Counter
import numpy as np
import pandas as pd

def dawg(li):
    return Counter(e for sl in map(list, map(set, li)) for e in sl)

def nump(a):
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

if __name__=='__main__':
    import timeit  
    li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS'] 
    for f in (dawg, nump):
        print("   ",f.__name__, timeit.timeit("f(li)", setup="from __main__ import f, li", number=100) )

Results:

dawg 0.00134205818176
nump 0.0347728729248

The Python solution is significantly faster in this case

score -1 · Answer 6 · answered Jan 13 '17 at 18:49

-1

Just do

counter = Counter(''.join(li))
most_common = counter.most_common()

and you're done

answered Jan 13 '17 at 18:49

blue_note

27,712
9
72
90

Fastest object to iterate a chars in a list of strings

6 Answers6