0

I'm trying to create a function in python that from a list of strings will return me a dict where the key(index) shows the most repetitive character for each index between all the strings. for example a list1 = ['one', 'two', 'twin', 'who'] should return index 0=t index 1=w index 2=o index 3=n in fact the most frequent character at the index 1 between all the string is 'w'. I found a solution but if I have lists with thousands of strings inside it will require too much time to perform. I would like to know if you can give me some help to decrease the time of execution.

Here is what I tried to do but seems too slow to perform with lists of thousands strings inside

list1 = ['one', 'two', 'twin', 'who']

width = len(max(list1, key=len))

chars = {}

for i, item in enumerate(zip(*[s.ljust(width) for s in list1])):
    set1 = set(item)
    if ' ' in set1:
        set1.remove(' ')
    chars[i] = max(set1, key=item.count)
print(chars)

2 Answers2

2

Whether something is quick enough is a matter of use case, but this solution uses a couple of seconds to go through the default wordlist available under OS X.

Python's collections.Counter implements a counter object for you, so you don't have keep track of the counts of multiple possible values yourself.

I've paired it with defaultdict, which intializes a key with a function if the key is undefined - so that if we haven't already seen the index we're updating the count for, it gets initialized to a Counter object that we then update.

from collections import defaultdict, Counter


with open("/usr/share/dict/words") as f:
    words = f.read().splitlines()
    letters = defaultdict(Counter)

    for word in words:
        for idx, letter in enumerate(word):
            letters[idx].update((letter, ))

for idx, counter in letters.items():
    print(idx, counter.most_common(1))

Whether this is quick enough depends on your use case as mentioned; it can be done a lot quicker if necessary, but it's probably quick enough. For 235 886 words the runtime is:

python3 letterfreq.py  2.67s user 0.04s system 99% cpu 2.734 total

This assumes that every word is lowercased, if not, lowercase it before adding it to your Counter object.

If you want to implement it without using the Counter or defaultdict parts of the standard library (which are just helper functionality to avoid reimplementing the same small code repeatedly), you can do the exact thing yourself manually:

with open("/usr/share/dict/words") as f:
    words = f.read().splitlines()
    letter_positions = {}

    for word in words:
        for idx, letter in enumerate(word):
            if idx not in letter_positions:
                letter_positions[idx] = {}

            if letter not in letter_positions[idx]:
                letter_positions[idx][letter] = 0

            letter_positions[idx][letter] += 1

final_dict = {}

for idx, counts in letter_positions.items():
    most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)
    print(idx, most_popular)
    final_dict[idx] = most_popular[0][0]

print(final_dict)

Then pick as many entries as necessary from most_popular when going through the list afterwards.

Since we're no longer using the defaultdict and Counter abstractions, our running time is now about a third of the previous one:

python3 letterfreq2.py  1.08s user 0.03s system 98% cpu 1.124 total

It's usually a good idea to go through what you're trying to do and formulate a strategy - i.e. "ok, I need to keep track of how many times a letter has appeared in this location .. so for that I need some way to keep values for each index .. and then for each letter ..".

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • thanks for the answer but I need a solution without importing any libraries – terrier99uk Nov 12 '22 at 11:38
  • Those are part of the standard library, but sure, you can just do the same thing as those functions/objects do manually; I've added an example of that as well. – MatsLindh Nov 12 '22 at 11:45
  • if I try it using the list1 = ['one', 'two', 'twin', 'who'] it doesn't return me a dictionary with {0:'t', 1:'w', 2:'o', 3:'n'} and that's what I need to get returned from the function when I print the dict – terrier99uk Nov 12 '22 at 12:03
  • @terrier99uk In that case, iterate over the sorted dict and assign it to a secondary dict. This outputs a sorted list with the characters sorted in popularity - feel free to do that last part to get the exact format you need; but I've added an example of that as well. – MatsLindh Nov 12 '22 at 12:05
  • @terrier99uk Depending on what you're asking about this for, if this is for a class assignment or similar (or any other reason, really), you need to spend to time to understand _what_ is happening and why something works the way to do. – MatsLindh Nov 12 '22 at 12:11
  • yes I understood every step you wrote. I just wanted to improve my code. just one last question.... most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)....here you take the most frequent chars from a defined index on a dictionary, if two characters have the same frequency how can I tell the function to get the lowest one alfabetically? for example between 'a' and 'f' with the same frequency I chose 'a' – terrier99uk Nov 12 '22 at 13:45
  • What runtime did you get on the same data with *their* solution? – Kelly Bundy Nov 12 '22 at 14:26
  • @KellyBundy Because you're sorting the dict by the value. `items()` give you a tuple with `(key, value)`, using `key=lambda x: x[1]` means that you use the value as the criteria for the sort. Their solution spent about 2.71-2.73s with the same dataset, so slightly slower than the version that uses Counter and defaultdict. – MatsLindh Nov 12 '22 at 14:48
  • Oh, that was just me not looking up whether giving a class was considered a callable, so I just wrapped it in a lambda while writing my stream of thought. No other reason. – MatsLindh Nov 12 '22 at 14:58
  • Could you measure one more? Their original, but with `key=''.join(item).count`. Strings count much faster than lists (and counting multiple times likely makes the single join worth it). – Kelly Bundy Nov 12 '22 at 15:08
  • @KellyBundy You have all the code available here, feel free to make your own measurements and attempted changes to speed it up; I've only used `time` to time the different solution, so the time includes starting the Python interpreter as well. That way you can also attach a line profiler and do lower level optimizations as you want. – MatsLindh Nov 12 '22 at 17:27
  • Ok, I did, tio.run happens to have such a file. That little change made it about 10x faster, and about twice as fast as your faster solution. – Kelly Bundy Nov 12 '22 at 19:26
0

I just make some improvements based on your algorithm.

First, you can use itertools.zip_longest() instead of zip() to remove the need of ljust() and the width variable:

from itertools import zip_longest

list1 = ['one', 'two', 'twin', 'who']

chars = {}

for i, item in enumerate(zip_longest(*list1)):
    set1 = set(item)
    if None in set1:
        set1.remove(None)
    chars[i] = max(set1, key=item.count)
print(chars)

Then, replace max(set1, key=item.count) with a more efficent way Counter(item).most_common(1)[0][0], combined with or set1.most_common(2)[1][0] to filter None values

from itertools import zip_longest
from collections import Counter

list1 = ['one', 'two', 'twin', 'who']

chars = {}

for i, item in enumerate(zip_longest(*list1)):
    set1 = Counter(item)
    chars[i] = set1.most_common(1)[0][0] or set1.most_common(2)[1][0]
print(chars)

As itertools and collections are Python built-in modules, you can import them directly without pip install them.

Thaiminhpv
  • 314
  • 2
  • 10
  • can you give me another solution without importing any libraries? because I need it without importing any libraries – terrier99uk Nov 12 '22 at 11:59