1

What I am trying to do is to analyse the frequency of the letters in a text. As an example, I will use here a small sentence, but all that is thought to analyse huge texts (so it's better to be efficient).

Well, I have the following text:

test = "quatre jutges dun jutjat mengen fetge dun penjat"

Then I created a function which counts the frequencies

def create_dictionary2(txt):
    dictionary = {}
    i=0
    for x in set(txt):
        dictionary[x] = txt.count(x)/len(txt)
    return dictionary

And then

import numpy as np
import matplotlib.pyplot as plt
test_dict = create_dictionary2(test)
plt.bar(test_dict.keys(), test_dict.values(), width=0.5, color='g')

I obtain enter image description here

ISSUES: I want to see all the letters, but some of them are not seen (Container object of 15 artists) How to expand the histogram? Then, I would like to sort the histogram, to obtain something like from this enter image description here

this enter image description here

Community
  • 1
  • 1
alienflow
  • 400
  • 7
  • 19

2 Answers2

5

For counting we can use a Counter object. Counter also supports getting key-value pairs on the most common values:

from collections import Counter

import numpy as np
import matplotlib.pyplot as plt

c = Counter("quatre jutges dun jutjat mengen fetge dun penjat")
plt.bar(*zip(*c.most_common()), width=.5, color='g')
plt.show()

The most_common method returns a list of key-value tuples. The *zip(*..) is used to unpack (see this answer).

Note: I haven't updated the width or color to match your result plots.

kelalaka
  • 5,064
  • 5
  • 27
  • 44
ikkuh
  • 4,473
  • 3
  • 24
  • 39
  • Okay, but this doesn't sort the bars of the histogram. I mean, I obtain the same result.. – alienflow Sep 24 '18 at 07:21
  • Not sure why it is not sorted for you. What is your output for `c.most_common()`? – ikkuh Sep 24 '18 at 07:27
  • 1
    Wow this is slick. :-) – sobek Sep 24 '18 at 07:29
  • This is the output of most_common: [('e', 7), (' ', 7), ('t', 6), ('u', 5), ('n', 5), ('j', 4), ('a', 3), ('g', 3), ('d', 2), ('q', 1), ('r', 1), ('s', 1), ('m', 1), ('f', 1), ('p', 1)] – alienflow Sep 24 '18 at 07:35
  • Is `plt.bar(['a', 'b', 'c'], [10, 5, 1])` sorted for you? – ikkuh Sep 24 '18 at 08:11
  • @ikkuh yes plt.bar(['a', 'b', 'c'], [10, 5, 1]) because a b and c appear in order, but c = Counter("abbbbcc") plt.bar(*zip(*c.most_common()), width=.5, color='g') is not – alienflow Sep 24 '18 at 09:01
  • You can check `list(zip(*c.most_common()))` to see that those are also in order. – ikkuh Sep 24 '18 at 10:26
2

Another solution using pandas:

import pandas as pd
import matplotlib.pyplot as plt

test = "quatre jutges dun jutjat mengen fetge dun penjat"

# convert input to list of chars so it is easy to get into pandas 
char_list = list(test)

# create a dataframe where each char is one row
df = pd.DataFrame({'chars': char_list})
# drop all the space characters
df = df[df.chars != ' ']
# add a column for aggregation later
df['num'] = 1
# group rows by character type, count the occurences in each group
# and sort by occurance
df = df.groupby('chars').sum().sort_values('num', ascending=False) / len(df)

plt.bar(df.index, df.num, width=0.5, color='g')
plt.show()

Result:

enter image description here

Edit: I timed my and ikkuh's solutions

Using counter: 10000 loops, best of 3: 21.3 µs per loop

Using pandas groupby: 10 loops, best of 3: 22.1 ms per loop

For this small dataset, Counter is definately a LOT faster. Maybe i'll time this for a bigger set when i have time.

sobek
  • 1,386
  • 10
  • 28
  • I have tried your code, but it doesnt appear sorted. Why? – alienflow Sep 24 '18 at 07:22
  • @alienflow I don't know, the code i posted works with the test data you provided. Did you leave something out when copying? What version of pandas do you have installed? – sobek Sep 24 '18 at 07:28
  • I'm using jupyter for python and version 0.22.0 pandas. I copied everything. Won't it be for using jupyter then? – alienflow Sep 24 '18 at 07:34
  • Actually, I have jsut tried the same in PYCHARM and the same result :( This is a nightmare – alienflow Sep 24 '18 at 07:50
  • @alienflow Is the picture above sorted correctly for you? How do you want the letters sorted? – sobek Sep 24 '18 at 08:29
  • Yes, I want to have it sorted like you, but I tried jupyter lab, pycharm and nothing seems to work, different scripts... It just appears like the image I have attached in the question – alienflow Sep 24 '18 at 08:59
  • I just ran it in jupyter lab and it works for me with pandas 0.20.3 and 0.23.0... – sobek Sep 24 '18 at 09:05
  • It's really weird and I dont know how to solve it :( Could it be something related with the plotting library? – alienflow Sep 24 '18 at 09:07