How to extract the top half of the most frequent values in a list (not a dictionnary)

Question

I have a list of values and want to extract the top half of the most frequent ones. My list is in mat_sup I used : mat_sup = np.column_stack(np.unique(mat_sup, return_counts=True)) which gives me the values and their number of appearances, cool. The I used : mat_sup = mat_sup[np.core.records.fromarrays([mat_sup[:,1]],names='a').argsort()] to sort my list based on the numbers in the second column (number of appearances). Unfortunately, numbers are stored as texts and the sort does not give the expected result. Any solution please?

add simple input and desired output... – I'mahdi Jun 25 '22 at 10:53 — I'mahdi, Jun 25 '22 at 10:53
And input and bad output that you want fix it. – I'mahdi Jun 25 '22 at 10:53 — I'mahdi, Jun 25 '22 at 10:53

Beni Trainor · Accepted Answer · 2022-06-25T12:28:43.213

0

I've tried reproducing your problem with a example list of words:

words = np.array([
    "zodiacal",
    "zodiacs",
    "zombi",
    "zombie",
    "zombie",
    "zoned",
    "zoned",
    "zones",
    "zoo",
    "zoological",
    "zoological",
    "zoological",
    "zoological",
    "zoologist",
    "zoologist's",
    "zoologists",
])

From there I calculated the word counts:

word_counts = np.column_stack(np.unique(words, return_counts=True))

And finally sorted (so far, the same as your code):

word_counts = word_counts[np.core.records.fromarrays([word_counts[:,1]],names='a').argsort()]

Now, from what I understand, you want the top half of the words that appear the most. This is what I came up with:

# Reverse and get a slice of the top half
word_counts = word_counts[::-1][:len(word_counts) // 2]

This outputs:

[['zoological' '4']
 ['zoned' '2']
 ['zombie' '2']
 ['zoologists' '1']
 ["zoologist's" '1']]

I hope this helps.

You might want to read this post which is where I got my answer.

EDIT: the solution that I previously wrote only works when occurences do not exceed 9. For cases, where they do exceed 9 you would do the following:

word_counts = list(word_counts)
word_counts.sort(key=lambda entry: int(entry[1]), reverse=True)
word_counts = np.array(word_counts)[:len(word_counts) // 2]

edited Jun 25 '22 at 12:28

answered Jun 25 '22 at 11:44

Beni Trainor

346
1
11

That's the way I ran my tests and it was perfectly OK. Your solution and mine run OK. Unfortunately, when the number of occurences are greater than 10 or 100, then the sort is made with these numbers taken as text! Should you grow the sample list with 12 "zoological", then my issue appears... – tibibou Jun 25 '22 at 12:08
You're right. I'll have a look. – Beni Trainor Jun 25 '22 at 12:21
I've edited my answer for cases where ocurrences exceed 9. I hope it helps. – Beni Trainor Jun 25 '22 at 12:29
Thank you, exactly what I was trying to achieve. Full working code : `word_counts = np.column_stack(np.unique(words, return_counts=True)) word_counts = list(word_counts) word_counts.sort(key=lambda entry: int(entry[1]), reverse=True) word_counts = np.array(word_counts)[:len(word_counts) // 2]` – tibibou Jun 25 '22 at 12:42
Hey @tibibou. If this answered your question could you mark it as "accepted"? This way others can see the answer that worked. – Beni Trainor Jun 26 '22 at 07:15

How to extract the top half of the most frequent values in a list (not a dictionnary)

1 Answers1