-1

I want to find the most occurring substring in a CSV row either by itself, or by using a list of keywords for lookup.

I've found a way to find out the top 5 most occurring words in each row of a CSV file using Python using the below responses, but, that doesn't solve my purpose. It gives me results like -

[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]

Whereas, I would want something like 'Trojan', 'Downloader', 'Powershell' ... as the top results.

The matching words can be a substring of a value (cell) in the CSV or can be a combination of two or more words. Can someone help fix this either by using a keywords list or without.

Thanks!

harry04
  • 900
  • 2
  • 9
  • 21
  • Possible duplicate of [Efficiently calculate word frequency in a string](https://stackoverflow.com/questions/9919604/efficiently-calculate-word-frequency-in-a-string) – BcK Jun 19 '18 at 05:21
  • Please print `df.head(5)` and post it here. This does not qualify as a valid example. – cs95 Jun 19 '18 at 05:26

1 Answers1

0

Let, my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z'] is your list of words which is to sort.

Now take a list which will store information of occurrences of every words.

count_dict={}

Populate the dictionary with appropriate values :

for i in my_values:
    if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
        count_dict[i]=1
    else:
        count_dict[i] = count_dict[i]+1 #If previously found then increment it's value

Now sort the values of dict according to their occurrences :

sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)

Now you have your expected results! The most occurring 3 values are:

print(sorted_items[:3])

output :

[('A', 5), ('Z', 3), ('X', 2)]

The most occurring 2 values are :

print(sorted_items[:3])

output:

[('A', 5), ('Z', 3)]

and so on.

Taohidul Islam
  • 5,246
  • 3
  • 26
  • 39