Python - Finding most occurring words in a CSV row

Question

I want to find the most occurring substring in a CSV row either by itself, or by using a list of keywords for lookup.

I've found a way to find out the top 5 most occurring words in each row of a CSV file using Python using the below responses, but, that doesn't solve my purpose. It gives me results like -

[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]

Whereas, I would want something like 'Trojan', 'Downloader', 'Powershell' ... as the top results.

The matching words can be a substring of a value (cell) in the CSV or can be a combination of two or more words. Can someone help fix this either by using a keywords list or without.

Thanks!

Possible duplicate of [Efficiently calculate word frequency in a string](https://stackoverflow.com/questions/9919604/efficiently-calculate-word-frequency-in-a-string) — BcK, Jun 19 '18 at 05:21
Please print `df.head(5)` and post it here. This does not qualify as a valid example. — cs95, Jun 19 '18 at 05:26

score 0 · Answer 1 · answered Jun 19 '18 at 05:33

0

Let, my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z'] is your list of words which is to sort.

Now take a list which will store information of occurrences of every words.

count_dict={}

Populate the dictionary with appropriate values :

for i in my_values:
    if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
        count_dict[i]=1
    else:
        count_dict[i] = count_dict[i]+1 #If previously found then increment it's value

Now sort the values of dict according to their occurrences :

sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)

Now you have your expected results! The most occurring 3 values are:

print(sorted_items[:3])

output :

[('A', 5), ('Z', 3), ('X', 2)]

The most occurring 2 values are :

print(sorted_items[:3])

output:

[('A', 5), ('Z', 3)]

and so on.

answered Jun 19 '18 at 05:33

Taohidul Islam

5,246
3
26
39

1

`Counter` reinvented unnecessarily. – BcK Jun 19 '18 at 05:41
Will you be more specific please? – Taohidul Islam Jun 19 '18 at 05:42
@TaohidulIslam this gives me " " as the most occurring string, and also doesn't account for substrings in each value. I think this is just considering the whole value of each cell in the CSV row. any fix for that? – harry04 Jun 19 '18 at 06:23
Please provide a list of your actual data. – Taohidul Islam Jun 19 '18 at 06:25
Give data list from which you want to get your expected result. – Taohidul Islam Jun 19 '18 at 06:26

Python - Finding most occurring words in a CSV row

1 Answers1